healthcare - RISC2 Project

Developing Efficient Scientific Gateways for Bioinformatics in Supercomputer Environments Supported by Artificial Intelligence

wp_risc — Mon, 20 Mar 2023 09:37:46 +0000

Scientific gateways bring enormous benefits to end users by simplifying access and hiding the complexity of the underlying distributed computing infrastructure. Gateways require significant development and maintenance efforts. BioinfoPortal^[1], through its CSGrid^[2] middleware, takes advantage of Santos Dumont ^[3] heterogeneous resources. However, task submission still requires a substantial step regarding deciding the best configuration that leads to efficient execution. This project aims to develop green and intelligent scientific gateways for BioinfoPortal supported by high-performance computing environments (HPC) and specialised technologies such as scientific workflows, data mining, machine learning, and deep learning. The efficient analysis and interpretation of Big Data opens new challenges to explore molecular biology, genetics, biomedical, and healthcare to improve personalised diagnostics and therapeutics; finding new avenues to deal with this massive amount of information becomes necessary. New Bioinformatics and Computational Biology paradigms drive storage, management, and data access. HPC and Big Data advanced in this domain represent a vast new field of opportunities for bioinformatics researchers and a significant challenge. the BioinfoPortal science gateway is a multiuser Brazilian infrastructure. We present several challenges for efficiently executing applications and discuss the findings on improving the use of computational resources. We performed several large-scale bioinformatics experiments that are considered computationally intensive and time-consuming. We are currently coupling artificial intelligence to generate models to analyze computational and bioinformatics metadata to understand how automatic learning can predict computational resources’ efficient use. The computational executions are conducted at Santos Dumont, the largest supercomputer in Latin America, dedicated to the research community with 5.1 Petaflops and 36,472 computational cores distributed in 1,134 computational nodes.

By:

Carneiro, B. Fagundes, C. Osthoff, G. Freire, K. Ocaña, L. Cruz, L. Gadelha, M. Coelho, M. Galheigo, and R. Terra are with the National Laboratory of Scientific Computing, Rio de Janeiro, Brazil.

Carvalho is with the Federal Center for Technological Education Celso Suckow da Fonseca, Rio de Janeiro, Brazil.

Douglas Cardoso is with the Polytechnic Institute of Tomar, Portugal.

Boito and L, Teylo is with the University of Bordeaux, CNRS, Bordeaux INP, INRIA, LaBRI, Talence, France.

Navaux is with the Informatics Institute, the Federal University of Rio Grande do Sul, and Rio Grande do Sul, Brazil.

References:

Ocaña, K. A. C. S.; Galheigo, M.; Osthoff, C.; Gadelha, L. M. R.; Porto, F.; Gomes, A. T. A.; Oliveira, D.; Vasconcelos, A. T. BioinfoPortal: A scientific gateway for integrating bioinformatics applications on the Brazilian national high-performance computing network. Future Generation Computer Systems, v. 107, p. 192-214, 2020.

Mondelli, M. L.; Magalhães, T.; Loss, G.; Wilde, M.; Foster, I.; Mattoso, M. L. Q.; Katz, D. S.; Barbosa, H. J. C.; Vasconcelos, A. T. R.; Ocaña, K. A. C. S; Gadelha, L. BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments. PeerJ, v. 1, p. 1, 2018.

Coelho, M.; Freire, G.; Ocaña, K.; Osthoff, C.; Galheigo, M.; Carneiro, A. R.; Boito, F.; Navaux, P.; Cardoso, D. O. Desenvolvimento de um Framework de Aprendizado de Máquina no Apoio a Gateways Científicos Verdes, Inteligentes e Eficientes: BioinfoPortal como Caso de Estudo Brasileiro In: XXIII Simpósio em Sistemas Computacionais de Alto Desempenho – WSCAD 2022 (https://wscad.ufsc.br/), 2022.

Terra, R.; Ocaña, K.; Osthoff, C.; Cruz, L.; Boito, F.; Navaux, P.; Carvalho, D. Framework para a Construção de Redes Filogenéticas em Ambiente de Computação de Alto Desempenho. In: XXIII Simpósio em Sistemas Computacionais de Alto Desempenho – WSCAD 2022 (https://wscad.ufsc.br/), 2022.

Ocaña, K.; Cruz, L.; Coelho, M.; Terra, R.; Galheigo, M.; Carneiro, A.; Carvalho, D.; Gadelha, L.; Boito, F.; Navaux, P.; Osthoff, C. ParslRNA-Seq: an efficient and scalable RNAseq analysis workflow for studies of differentiated gene expression. In: Latin America High-Performance Computing Conference (CARLA), 2022, Rio Grande do Sul, Brazil. Proceedings of the Latin American High-Performance Computing Conference – CARLA 2022 (http://www.carla22.org/), 2022.

^[1] https://bioinfo.lncc.br/

^[2] https://git.tecgraf.puc-rio.br/csbase-dev/csgrid/-/tree/CSGRID-2.3-LNCC

^[3] https://https://sdumont.lncc.br

The post Developing Efficient Scientific Gateways for Bioinformatics in Supercomputer Environments Supported by Artificial Intelligence first appeared on RISC2 Project.

Using supercomputing for accelerating life science solutions

wp_risc — Tue, 01 Nov 2022 14:11:06 +0000

The world of High Performance Computing (HPC) is now moving towards exascale performance, i.e. the ability of calculating 10¹⁸ operations per second. A variety of applications will be improved to take advantage of this computing power, leading to better prediction and models in different fields, like Environmental Sciences, Artificial Intelligence, Material Sciences and Life Sciences.

In Life Sciences, HPC advancements can improve different areas:

a reduced time to scientific discovery;
the ability of generating predictions necessary for precision medicine;
new healthcare and genomics-driven research approaches;
the processing of huge datasets for deep and machine learning;
the optimization of modeling, such as Computer Aided Drug Design (CADD);
enhanched security and protection of healthcare data in HPC environments, in compliance with European GDPR regulations;
management of massive amount of data for example for clinical trials, drug development and genomics data analytics.

The outbreak of COVID-19 has further accelerated this progress from different points of view. Some European projects aim at reusing known and active ingredients to prepare new drugs as contrast therapy against COVID disease [Exscalate4CoV, Ligate], while others focus on the management and monitoring of contagion clusters to provide an innovative approach to learn from SARS-CoV-2 crisis and derive recommendations for future waves and pandemics [Orchestra].

The ability to deal with massive amounts of data in HPC environments is also used to create databases with data from nucleic acids sequencing and use them to detect allelic variant frequencies, as in the NIG project [Nig], a collaboration with the Network for Italian Genomes. Another example of usage of this capability is the set-up of data sharing platform based on novel Federated Learning schemes, to advance research in personalised medicine in haematological diseases [Genomed4All].

Supercomputing is widely used in Drug Design (the process of finding medicines for disease for which there are no or insufficient treatments), with many projects active in this field just like RISC2.

Sometimes, when there is no previous knowledge of the biological target, just like what happened with COVID-19, discovering new drugs requires creating from scratch new molecules [Novartis]. This process involves billion dollar investments to produce and test thousands of molecules and it usually has a low success rate: only about 12% of potential drugs entering the clinical development are approved [Engitix]. The whole process from identifying a possible compound to the end of the clinical trial can take up to 10 years. Nowadays there is an uneven coverage of disease: most of the compounds are used for genetic conditions, while only a few antiviral and antibiotics have been found.

The search for candidate drugs occurs mainly through two different approaches: high-throughput screening and virtual screening. The first one is more reliable but also very expensive and time consuming: it is usually applied when dealing with well-known targets by mainly pharmaceutical companies. The second approach is a good compromise between cost and accuracy and is typically applied against relatively new targets, in academics laboratories, where it is also used to discover or understand better mechanisms of these targets. [Liu2016]

Candidate drugs are usually small molecules that bind to a specific protein or part of it, inhibiting the usual activity of the protein itself. For example, binding the correct ligand to a vial enzyme may stop viral infection. In the process of virtual screening million of compounds are screened against the target protein at different levels: the most basic one simply takes into account the shape to correctly fit into the protein, at higher level also other features are considered as specific interactions, protein flexibility, solubility, human tolerance, and so on. A “score” is assigned to each docked ligand: compounds with highest score are further studied. With massively parallel computers, we can rapidly filter extremely large molecule databases (e.g. billions of molecules).

The current computational power of HPC clusters allow us to analyze up to 3 million compounds per second [Exscalate]. Even though vaccines were developed remarkably quickly, effective drug treatments for people already suffering from covid-19 were very fresh at the beginning of the pandemic. At that time, supercomputers around the world were asked to help with drug design, a real-world example of the power of Urgent Computing. CINECA participates in Exscalate4cov [Exscalate4Cov], currently the most advanced center of competence for fighting the coronavirus, combining the most powerful supercomputing resources and Artificial Intelligence with experimental facilities and clinical validation.

References

[Engitix] https://engitix.com/technology/

[Exscalate] https://www.exscalate.eu/en/projects.html

[Exscalate4CoV] https://www.exscalate4cov.eu/

[Genomed4All] https://genomed4all.eu/

[Ligate] https://www.ligateproject.eu/

[Liu2016] T. Liu, D. Lu, H. Zhang, M. Zheng, H. Yang, Ye. Xu, C. Luo, W. Zhu, K. Yu, and H. Jiang, “Applying high-performance computing in drug discovery and molecular simulation” Natl Sci Rev. 2016 Mar; 3(1): 49–63.

[Nig] http://www.nig.cineca.it/

[Novartis] https://www.novartis.com/stories/art-drug-design-technological-age

[Orchestra] https://orchestra-cohort.eu/

By CINECA

The post Using supercomputing for accelerating life science solutions first appeared on RISC2 Project.

National Laboratory for Scientific Computing participated in the ISC2021

wp_risc — Fri, 13 Aug 2021 09:55:06 +0000

The National Laboratory for Scientific Computing (LNCC), one of the RISC2 partners from Brazil, presented two posters at the Event for High Performance Computing, Machine Learning and Data Analysis (ISC) 2021.

The posters “Developing Efficient Scientific Gateways for Bioinformatics in Supercomputing Environments Supported by Artificial Intelligence” and “Scalable Numerical Method for Biphasic Flows in Heterogeneous Porous Media in High-Performance Computational Environments” are part of the activities of the LNCC RISC2 projects.

According to Carla Osthoff (LNCC) , former poster presents a collaboration project that aims to develop green and intelligent scientific gateways for bioinformatics supported by high-performance computing environments (HPC) and specialized technologies such as scientific workflows, data mining, machine learning, and deep learning. The efficient analysis and interpretation of Big Data open new challenges to explore molecular biology, genetics, biomedical, and healthcare to improve personalized diagnostics and therapeutics; then, it becomes necessary to availability of new avenues to deal with this massive amount of information. New paradigms in Bioinformatics and Computational Biology drive the storing, managing, and accessing of data. HPC and Big Data advances in this domain represent a vast new field of opportunities for bioinformatics researchers and a significant challenge. The Bioinfo-Portal science gateway is a multiuser Brazilian infrastructure for bioinformatics applications, benefiting from the HPC infrastructure. We present several challenges for efficiently executing applications and discussing the findings on how to improve the use of computational resources. We performed several large-scale bioinformatics experiments that are considered computationally intensive and time-consuming. We are currently coupling artificial intelligence to generate models to analyze computational and bioinformatics metadata to understand how automatic learning can predict computational resources’ efficient use. The computational executions are carried out at Santos Dumont Supercomputer. This is a multi-disciplinary project requiring expertise from several knowledge areas from four research institutes (LNCC, UFRGS, INRIA Bordeaux, and CENAT in Costa Rica). Finally, Brazilian funding agencies (CNPQ, CAPES) and the RISC-2 project from the European Economic and Social Committee (EESC) support the project.

Latter poster presents a project that aims to develop a scalable numerical approach for biphasic flows in heterogeneous porous media in high-performance computing environments based on the high-performance numerical methodology. In this system, an elliptical subsystem determines the velocity field, and a non-linear hyperbolic equation represents the transport of the flowing phases (saturation equation). The model applies a locally conservative finite element method for the mixing speed. Furthermore, the model employs a high-order non-oscillatory finite volume method, based on central schemes, for the non-linear hyperbolic equation that governs phase saturation. Specifically, the project aims to build scalable codes for a high-performance environment. Identified the bottlenecks in the code, the project is now working in four different research areas. Parallel I/O routines and high-performance visualization to decrease the I/O transfers bottleneck, Parallel programming to reduce code bottlenecks for multicore and manycore architectures. and Adaptive MPI to decrease the message communication bottleneck. The poster presents the first performance evaluation results used to guide the project research areas. This endeavor is a multi-disciplinary project requiring expertise from several knowledge areas from four research institutes (LNCC, UFRGS, UFLA in Brazil, and CENAT in Costa Rica). Finally, Brazilian funding agencies (CNPQ, CAPES) and the RISC-2 project.

The post National Laboratory for Scientific Computing participated in the ISC2021 first appeared on RISC2 Project.