exascale computing - RISC2 Project

Subsequent Progress And Challenges Concerning The México-UE Project ENERXICO: Supercomputing And Energy For México

wp_risc — Wed, 24 May 2023 09:38:01 +0000

In this short notice, we briefly describe some afterward advances and challenges with respect to two work packages developed in the ENERXICO Project. This opened the possibility of collaborating with colleagues from institutions that did not participate in the project, for example from the University of Santander in Colombia and from the University of Vigo in Spain. This exemplifies the importance of the RISC2 project in the sense that strengthening collaboration and finding joint research areas and HPC applied ventures is of great benefit for both: our Latin American Countries and the EU. We are now initiating talks to target several Energy related topics with some of the RISC2 partners.

The ENERXICO Project focused on developing advanced simulation software solutions for oil & gas, wind energy and transportation powertrain industries. The institutions that collaborated in the project are for México: ININ (Institution responsible for México), Centro de Investigación y de Estudios Avanzados del IPN (Cinvestav), Universidad Nacional Autónoma de México (UNAM IINGEN, FCUNAM), Universidad Autónoma Metropolitana-Azcapotzalco, Instituto Mexicano del Petróleo, Instituto Politécnico Nacional (IPN) and Pemex, and for the European Union: Centro de Supercómputo de Barcelona (Institution responsible for the EU), Technische Universitäts München, Alemania (TUM), Universidad de Grenoble Alpes, Francia (UGA), CIEMAT, España, Repsol, Iberdrola, Bull, Francia e Universidad Politécnica de Valencia, España.

The Project contemplated four working packages (WP):

WP1 Exascale Enabling: This was a cross-cutting work package that focused on assessing performance bottlenecks and improving the efficiency of the HPC codes proposed in vertical WP (UE Coordinator: BULL, MEX Coordinator: CINVESTAV-COMPUTACIÓN);

WP2 Renewable energies: This WP deployed new applications required to design, optimize and forecast the production of wind farms (UE Coordinator: IBR, MEX Coordinator: ININ);

WP3 Oil and gas energies: This WP addressed the impact of HPC on the entire oil industry chain (UE Coordinator: REPSOL, MEX Coordinator: ININ);

WP4 Biofuels for transport: This WP displayed advanced numerical simulations of biofuels under conditions similar to those of an engine (UE Coordinator: UPV-CMT, MEX Coordinator: UNAM);

For WP1 the following codes were optimized for exascale computers: Alya, Bsit, DualSPHysics, ExaHyPE, Seossol, SEM46 and WRF.

As an example, we present some of the results for the DualPHYysics code. We evaluated two architectures: The first set of hardware used were identical nodes, each equipped with 2 ”Intel Xeon Gold 6248 Processors”, clocking at 2.5 GHz with about 192 GB of system memory. Each node contained 4 Nvidia V100 Tesla GPUs with 32 GB of main memory each. The second set of hardware used were identical nodes, each equipped with 2 ”AMD Milan 7763 Processors”, clocking at 2.45 GHz with about 512 GB of system memory. Each node contained 4 Nvidia V100 Ampere GPUs with 40 GB of main memory each. The code was compiled and linked with CUDA 10.2 and OpenMPI 4. The application was executed using one GPU per MPI rank.

In Figures 1 and 2 we show the scalability of the code for the strong and weak scaling tests that indicate that the scaling is very good. Motivated by these excellent results, we are in the process of performing in the LUMI supercomputer new SPH simulations with up to 26,834 million particles that will be run with up to 500 GPUs, which is 53.7 million particles per GPU. These simulations will be done initially for a Wave Energy Converter (WEC) Farm (see Figure 3), and later for turbulent models.

Figure 1. Strong scaling test with a fix number of particles but increasing number of GPUs.

Figure 2. Weak scaling test with increasing number of particles and GPUs.

Figure 3. Wave Energy Converter (WEC) Farm (taken from https://corpowerocean.com/)

As part of WP3, ENERXICO developed a first version of a computer code called Black Hole (or BH code) for the numerical simulation of oil reservoirs, based on the numerical technique known as Smoothed Particle Hydrodynamics or SPH. This new code is an extension of the DualSPHysics code (https://dual.sphysics.org/) and is the first SPH based code that has been developed for the numerical simulation of oil reservoirs and has important benefits versus commercial codes based on other numerical techniques.

The BH code is a large-scale massively parallel reservoir simulator capable of performing simulations with billions of “particles” or fluid elements that represent the system under study. It contains improved multi-physics modules that automatically combine the effects of interrelated physical and chemical phenomena to accurately simulate in-situ recovery processes. This has led to the development of a graphical user interface, considered as a multiple-platform application for code execution and visualization, and for carrying out simulations with data provided by industrial partners and performing comparisons with available commercial packages.

Furthermore, a considerable effort is presently being made to simplify the process of setting up the input for reservoir simulations from exploration data by means of a workflow fully integrated in our industrial partners’ software environment. A crucial part of the numerical simulations is the equation of state. We have developed an equation of state based on crude oil data (the so-called PVT) in two forms, the first as a subroutine that is integrated into the code, and the second as an interpolation subroutine of properties’ tables that are generated from the equation of state subroutine.

An oil reservoir is composed of a porous medium with a multiphase fluid made of oil, gas, rock and other solids. The aim of the code is to simulate fluid flow in a porous medium, as well as the behaviour of the system at different pressures and temperatures. The tool should allow the reduction of uncertainties in the predictions that are carried out. For example, it may answer questions about the benefits of injecting a solvent, which could be CO2, nitrogen, combustion gases, methane, etc. into a reservoir, and the times of eruption of the gases in the production wells. With these estimates, it can take the necessary measures to mitigate their presence, calculate the expense, the pressure to be injected, the injection volumes and most importantly, where and for how long. The same happens with more complex processes such as those where fluids, air or steam are injected, which interact with the rock, oil, water and gas present in the reservoir. The simulator should be capable of monitoring and preparing measurement plans.

In order to be able to perform a simulation of a reservoir oil field, an initial model needs to be created. Using geophysical forward and inverse numerical techniques, the ENERXICO project evaluated novel, high-performance simulation packages for challenging seismic exploration cases that are characterized by extreme geometric complexity. Now, we are undergoing an exploration of high-order methods based upon fully unstructured tetrahedral meshes and also tree-structured Cartesian meshes with adaptive mesh refinement (AMR) for better spatial resolution. Using this methodology, our packages (and some commercial packages) together with seismic and geophysical data of naturally fractured reservoir oil fields, are able to create the geometry (see Figure 4), and exhibit basic properties of the oil reservoir field we want to study. A number of numerical simulations are performed and from these oil fields exploitation scenarios are generated.

Figure 4. A detail of the initial model for a SPH simulation of a porous medium.

More information about the ENERXICO Project can be found in: https://enerxico-project.eu/

By: Jaime Klapp (ININ, México) and Isidoro Gitler (Cinvestav, México)

The post Subsequent Progress And Challenges Concerning The México-UE Project ENERXICO: Supercomputing And Energy For México first appeared on RISC2 Project.

Webinar: Improving energy-efficiency of High-Performance Computing clusters

wp_risc — Thu, 26 Jan 2023 13:37:07 +0000

Date: April 26, 2023 | 3 p.m. (UTC+1)

Speakers: Lubomir Riha and Ondřej Vysocký, IT4Innovations National Supercomputing Center

Moderator: Esteban Mocskos, Universidad de Buenos Aires

High-Performance Computing centers consume megawatts of electrical power, which is a limiting factor in building bigger systems on the path to exascale and post-exascale clusters. Such high power consumption leads to several challenges including robust power supply and its network, enormous energy bills, or significant CO2 emissions. To increase power efficiency, vendors accommodate various heterogeneous hardware that must be fully utilized by users’ applications, to be used efficiently. Such requirements may be hard to fulfill, which open a possibility of limiting the available resources for additional power and energy savings with no or small performance penalty.

The talk will present best practices on how to grant rights to control hardware parameters, how to measure the energy consumption of the hardware, and what can be expected from performing energy-saving activities based on hardware tuning.

About the speakers:

Lubomir Riha, Ph.D. is the Head of the Infrastructure Research Lab at IT4Innovations National Supercomputing Center. Previously he was a research scientist in the High-Performance Computing Lab at George Washington University, ECE Department. He received his Ph.D. degree in Electrical Engineering from the Czech Technical University in Prague, Czech Republic, and a Ph.D. degree in Computer Science from Bowie State University, USA. Currently, he is a local principal investigator of two EuroHPC Centers of Excellence: MAX and SPACE, and two EuroHPC projects: SCALABLE and EUPEX (designs a prototype of the European Exascale machine). Previously he was a local PI of the H2020 Center of Excellence POP2 and H2020-FET HPC READEX projects. His research interests are optimization of HPC applications, energy-efficient computing, acceleration of scientific and engineering applications using GPU and many-core accelerators, and parallel and distributed rendering.

Ondrej Vysocky is a Ph.D. candidate at VSB – Technical University of Ostrava, Czech Republic and at the same time he works at IT4Innovations in Infrastructure Research Lab. His research is focused on energy efficiency in high-performance computing. He was an investigator of the Horizon 2020 READEX project which dealt with the energy efficiency of parallel applications using dynamic tuning. Since that time, he develops a MERIC library, a runtime system for energy measurement and hardware parameters tuning during a parallel application run. Using the library he is an investigator of several H2020 projects including Performance Optimisation and Productivity (POP2), or European Pilot for Exascale (EUPEX). He is also a member of the PowerStack initiative, which works on a holistic, extensible, and scalable approach of power management.

The post Webinar: Improving energy-efficiency of High-Performance Computing clusters first appeared on RISC2 Project.

RISC2 webinar series aims to benefit HPC research and industry in Europe and Latin America

wp_risc — Thu, 26 Jan 2023 13:32:50 +0000

After the success of the first 4 webinars, the RISC2 Webinar Series “HPC System & Tools” is back for its 2nd season. The webinars will be happening until May 2023, starting on February 22.

In each webinar, it will be presented the state-of-the-art in methods and tools for setting-up and maintaining HPC hardware and software infrastructures. The duration of each talk will be around 30-40 minutes, followed by a 10–15-minute moderated discussion with the audience.

There are already 4 webinars scheduled:

Developing complex workflows that include HPC, Artificial Intelligence and Data Analytics | February 22

A roadmap to quantum computing integration into HPC infrastructures | March 15

Improving energy-efficiency of High-Performance Computing clusters | April 26

Addressing the challenges of scientific visualization in the exascale age | May 31

The post RISC2 webinar series aims to benefit HPC research and industry in Europe and Latin America first appeared on RISC2 Project.

Webinar: Addressing the challenges of scientific visualization in the exascale age

wp_risc — Tue, 24 Jan 2023 10:56:42 +0000

Date: May 31, 2023 | 4 p.m. (UTC+1)

Speaker: João Barbosa, INESC TEC & MACC

Moderator: Bernd Mohr, Jülich Supercomputing Centre (JSC)

In the coming age of exascale computing, traditional post-hoc scientific visualization and analysis suffer similar challenges as numeric simulation. This talk will cover new methodologies of scientific visualization in high-performance computing systems specially designed for large-scale scientific visualization that provides greater scalability, flexibility, and detail to overcome some of these challenges.

About the speaker: João Barbosa joined the Minho Advanced Computing Center (MACC) in March 2020 as a full-time researcher in High-performance Computing, specializing in Scientific Visualization. Previously, he was part of the Texas Advanced Computing Center (TACC) Scalable Visualization team. As Research Associate at TACC, João has worked on several Scientific Visualization (SciVis) projects ranging from high-level applications such as Gas and Oil to low-level high-performance software packages in partnership with leading hardware and software companies. His current research focuses on high-performance real-time in-situ photo-realistic ray tracing for SciVis.

The post Webinar: Addressing the challenges of scientific visualization in the exascale age first appeared on RISC2 Project.

JUPITER Ascending – First European Exascale Supercomputer Coming to Jülich

wp_risc — Mon, 02 Jan 2023 12:14:22 +0000

It was finally decided in 2022: Forschungszentrum Jülich will be home to Europe’s first exascale computer. The supercomputer is set to be the first in Europe to surpass the threshold of one trillion (“1” followed by 18 zeros) calculations per second. The system will be acquired by the European supercomputing initiative EuroHPC JU. The exascale computer should help to solve important and urgent scientific questions regarding, for example, climate change, how to combat pandemics, and sustainable energy production, while also enabling the intensive use of artificial intelligence and the analysis of large data volumes. The overall costs for the system amount to 500 million euros. Of this total, 250 million euros is being provided by EuroHPC JU and a further 250 million euros in equal parts by the German Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MKW NRW).

The computer named JUPITER (short for “Joint Undertaking Pioneer for Innovative and Transformative Exascale Research”) will be installed 2023/2024 on the campus of Forschungszentrum Jülich. It is intended that the system will be operated by the Jülich Supercomputing Centre (JSC), whose supercomputers JUWELS and JURECA currently rank among the most powerful in the world. JSC has participated in the application procedure for a high-end supercomputer as a member of the Gauss Centre for Supercomputing (GCS), an association of the three German national supercomputing centres JSC in Jülich, High Performance Computing Stuttgart (HLRS), and Leibniz Computing Centre (LRZ) in Garching. The competition was organized by the European supercomputing initiative EuroHPC JU, which was formed by the European Union together with European countries and private companies.

JUPITER is now set to become the first European supercomputer to make the leap into the exascale class. In terms of computing power, it will be more powerful that 5 million modern laptops of PCs. Just like Jülich’s current supercomputer JUWELS, JUPITER will be based on a dynamic, modular supercomputing architecture, which Forschungszentrum Jülich developed together with European and international partners in the EU’s DEEP research projects.

In a modular supercomputer, various computing modules are coupled together. This enables program parts of complex simulations to be distributed over several modules, ensuring that the various hardware properties can be optimally utilized in each case. Its modular construction also means that the system is well prepared for integrating future technologies such as quantum computing or neurotrophic modules, which emulate the neural structure of a biological brain.

Figure Modular Supercomputing Architecture: Computing and storage modules of the exascale computer in its basis configuration (blue) as well as optional modules (green) and modules for future technologies (purple) as possible extensions.

In its basis configuration, JUPITER will have and enormously powerful booster module with highly efficient GPU-based computation accelerators. Massively parallel applications are accelerated by this booster in a similar way to a turbocharger, for example to calculate high-resolution climate models, develop new materials, simulate complex cell processes and energy systems, advanced basic research, or train next-generation, computationally intensive machine-learning algorithms.

One major challenge is the energy that is required for such large computing power. The average power is anticipated to be up to 15 megawatts. JUPITER has been designed as a “green” supercomputer and will be powered by green electricity. The envisaged warm water cooling system should help to ensure that JUPITER achieves the highest efficiency values. At the same time, the cooling technology opens up the possibility of intelligently using the waste heat that is produced. For example, just like its predecessor system JUWELS, JUPITER will be connected to the new low-temperature network on the Forschungszentrum Jülich campus. Further potential applications for the waste heat from JUPITER are currently being investigated by Forschungszentrum Jülich.

By Jülich Supercomputing Centre (JSC)

The first image is JUWELS: Germany’s fastest supercomputer JUWELS at Forschungszentrum Jülich, which is funded in equal parts by the Federal Ministry of Education and Research (BMBF) and the Ministry of Culture and Science of the State of North Rhine-Westphalia (MKW NRW) via the Gauss Centre for Supercomputing (GCS). (Copyright: Forschungszentrum Jülich / Sascha Kreklau)

The post JUPITER Ascending – First European Exascale Supercomputer Coming to Jülich first appeared on RISC2 Project.

RISC2 receives honors in 2022 HPCwire Readers’ and Editors’ Choice Awards

wp_risc — Thu, 17 Nov 2022 11:17:39 +0000

The RISC2 project has been recognised in the annual HPCwire Readers’ and Editors’ Choice Awards, presented at the 2022 International Conference for High Performance Computing, Networking, Storage, and Analysis (SC22), in Dallas, Texas. RISC2 was selected for the Best HPC Collaboration (Academia/Government/Industry).

Fabrizio Gagliardi, director of the RISC2, says “I feel particularly honored by this recognition on behalf of all the project members who have worked so hard to achieve in such a short time, and with limited resources, a considerable impact in promoting HPC activities in Latin America in collaboration with Europe”.

Editors’ Choice: Best HPC Collaboration (Academia/Government/Industry)

The RISC2 project, following the RISC2 Project, aims to promote and improve the relationship between research and industrial communities, focusing on HPC application and infrastructure deployment, between Europe and Latin America. Led by the Barcelona Supercomputing Center (BSC), RISC2 brings together 16 partners from 12 different countries.

About the HPCwire Readers’ and Editors’ Choice Awards

The list of winners was revealed at the SC22 HPCwire booth and on the HPCwire website.

The coveted annual HPCwire Readers’ and Editors’ Choice Awards are determined through a nomination and voting process with the global HPCwire community, a well a selections from the HPCwire editors. The awards are an annual feature of the publication and constitute prestigious recognition from the HPC community . They are revealed each year too kick off the annual supercomputing conference, which showcases high performance computing, networking, storage, and data analysis.

“The 2022 Readers’ and Editors’ Choice Awards are exceptional, indeed. Solutions developed with HPC led the world out of the Pandemic, and we officially broke the Exascale threshold – HPC has now reached a billion, billion operations per second!” said Tom Tabor, CEO of Tabor Communications, publishers of HPCwire. “Between our worldwide readership of HPC experts and the most renowned panel of editors in the industry, the Readers’ and Editors’ Choice Awards represent resounding recognition of HPC accomplishments throughout the world. Our sincerest gratitude and hearty congratulations go out to all of the winners.”

The post RISC2 receives honors in 2022 HPCwire Readers’ and Editors’ Choice Awards first appeared on RISC2 Project.

Using supercomputing for accelerating life science solutions

wp_risc — Tue, 01 Nov 2022 14:11:06 +0000

The world of High Performance Computing (HPC) is now moving towards exascale performance, i.e. the ability of calculating 10¹⁸ operations per second. A variety of applications will be improved to take advantage of this computing power, leading to better prediction and models in different fields, like Environmental Sciences, Artificial Intelligence, Material Sciences and Life Sciences.

In Life Sciences, HPC advancements can improve different areas:

a reduced time to scientific discovery;
the ability of generating predictions necessary for precision medicine;
new healthcare and genomics-driven research approaches;
the processing of huge datasets for deep and machine learning;
the optimization of modeling, such as Computer Aided Drug Design (CADD);
enhanched security and protection of healthcare data in HPC environments, in compliance with European GDPR regulations;
management of massive amount of data for example for clinical trials, drug development and genomics data analytics.

The outbreak of COVID-19 has further accelerated this progress from different points of view. Some European projects aim at reusing known and active ingredients to prepare new drugs as contrast therapy against COVID disease [Exscalate4CoV, Ligate], while others focus on the management and monitoring of contagion clusters to provide an innovative approach to learn from SARS-CoV-2 crisis and derive recommendations for future waves and pandemics [Orchestra].

The ability to deal with massive amounts of data in HPC environments is also used to create databases with data from nucleic acids sequencing and use them to detect allelic variant frequencies, as in the NIG project [Nig], a collaboration with the Network for Italian Genomes. Another example of usage of this capability is the set-up of data sharing platform based on novel Federated Learning schemes, to advance research in personalised medicine in haematological diseases [Genomed4All].

Supercomputing is widely used in Drug Design (the process of finding medicines for disease for which there are no or insufficient treatments), with many projects active in this field just like RISC2.

Sometimes, when there is no previous knowledge of the biological target, just like what happened with COVID-19, discovering new drugs requires creating from scratch new molecules [Novartis]. This process involves billion dollar investments to produce and test thousands of molecules and it usually has a low success rate: only about 12% of potential drugs entering the clinical development are approved [Engitix]. The whole process from identifying a possible compound to the end of the clinical trial can take up to 10 years. Nowadays there is an uneven coverage of disease: most of the compounds are used for genetic conditions, while only a few antiviral and antibiotics have been found.

The search for candidate drugs occurs mainly through two different approaches: high-throughput screening and virtual screening. The first one is more reliable but also very expensive and time consuming: it is usually applied when dealing with well-known targets by mainly pharmaceutical companies. The second approach is a good compromise between cost and accuracy and is typically applied against relatively new targets, in academics laboratories, where it is also used to discover or understand better mechanisms of these targets. [Liu2016]

Candidate drugs are usually small molecules that bind to a specific protein or part of it, inhibiting the usual activity of the protein itself. For example, binding the correct ligand to a vial enzyme may stop viral infection. In the process of virtual screening million of compounds are screened against the target protein at different levels: the most basic one simply takes into account the shape to correctly fit into the protein, at higher level also other features are considered as specific interactions, protein flexibility, solubility, human tolerance, and so on. A “score” is assigned to each docked ligand: compounds with highest score are further studied. With massively parallel computers, we can rapidly filter extremely large molecule databases (e.g. billions of molecules).

The current computational power of HPC clusters allow us to analyze up to 3 million compounds per second [Exscalate]. Even though vaccines were developed remarkably quickly, effective drug treatments for people already suffering from covid-19 were very fresh at the beginning of the pandemic. At that time, supercomputers around the world were asked to help with drug design, a real-world example of the power of Urgent Computing. CINECA participates in Exscalate4cov [Exscalate4Cov], currently the most advanced center of competence for fighting the coronavirus, combining the most powerful supercomputing resources and Artificial Intelligence with experimental facilities and clinical validation.

References

[Engitix] https://engitix.com/technology/

[Exscalate] https://www.exscalate.eu/en/projects.html

[Exscalate4CoV] https://www.exscalate4cov.eu/

[Genomed4All] https://genomed4all.eu/

[Ligate] https://www.ligateproject.eu/

[Liu2016] T. Liu, D. Lu, H. Zhang, M. Zheng, H. Yang, Ye. Xu, C. Luo, W. Zhu, K. Yu, and H. Jiang, “Applying high-performance computing in drug discovery and molecular simulation” Natl Sci Rev. 2016 Mar; 3(1): 49–63.

[Nig] http://www.nig.cineca.it/

[Novartis] https://www.novartis.com/stories/art-drug-design-technological-age

[Orchestra] https://orchestra-cohort.eu/

By CINECA

The post Using supercomputing for accelerating life science solutions first appeared on RISC2 Project.

HPC meets AI and Big Data

wp_risc — Thu, 06 Oct 2022 08:23:34 +0000

HPC services are no longer solely targeted at highly parallel modelling and simulation tasks. Indeed, the computational power offered by these services is now being used to support data-centric Big Data and Artificial Intelligence (AI) applications. By combining both types of computational paradigms, HPC infrastructures will be key for improving the lives of citizens, speeding up scientific breakthrough in different fields (e.g., health, IoT, biology, chemistry, physics), and increasing the competitiveness of companies [OG+15, NCR+18].

As the utility and usage of HPC infrastructures increases, more computational and storage power is required to efficiently handle the amount of targeted applications. In fact, many HPC centers are now aiming at exascale supercomputers supporting at least one exaFLOPs (1018 operations per second), which represents a thousandfold increase in processing power over the first petascale computer deployed in 2008 [RD+15]. Although this is a necessary requirement for handling the increasing number of HPC applications, there are several outstanding challenges that still need to be tackled so that this extra computational power can be fully leveraged.

Management of large infrastructures and heterogeneous workloads: By adding more compute and storage nodes, one is also increasing the complexity of the overall HPC distributed infrastructure and making it harder to monitor and manage. This complexity is increased due to the need of supporting highly heterogeneous applications that translate into different workloads with specific data storage and processing needs [ECS+17]. For example, on the one hand, traditional scientific modeling and simulation tasks require large slices of computational time, are CPU-bound, and rely on iterative approaches (parametric/stochastic modeling). On the other hand, data-driven Big Data applications contemplate shorter computational tasks, that are I/O bound and, in some cases, have real-time response requirements (i.e., latency-oriented). Also, many of the applications leverage AI and machine learning tools that require specific hardware (e.g., GPUs) in order to be efficient.

Support for general-purpose analytics: The increased heterogeneity also demands that HPC infrastructures are now able to support general-purpose AI and BigData applications that were not designed explicitly to run on specialised HPC hardware [KWG+13]. Therefore, developers are not required to significantly change their applications so that they can execute efficiently at HPC clusters.

Avoiding the storage bottleneck: By only increasing the computational power and improving the management of HPC infrastructures it may still not be possible to fully harmed the capabilities of these infrastructures. In fact, Big Data and AI applications are data-driven and require efficient data storage and retrieval from HPC clusters. With an increasing number of applications and heterogeneous workloads, the storage systems supporting HPC may easily become a bottleneck [YDI+16, ECS+17]. Indeed, as pointed out by several studies, the storage access time is one of the major bottlenecks limiting the efficiency of current and next-generation HPC infrastructures.

In order to address these challenges, RISC2 partners are exploring: New monitoring and debugging tools that can aid in the analysis of complex AI and Big Data workloads in order to pinpoint potential performance and efficiency bottlenecks, while helping system administrators and developers on troubleshooting these [ENO+21].

Emerging virtualization technologies, such as containers, that enable users to efficiently deploy and execute traditional AI and BigData applications in an HPC environment, without requiring any changes to their source-code [FMP21].

The Software-Defined Storage paradigm in order to improve the Quality-of-Service (QoS) for HPC’s storage services when supporting hundreds to thousands of data-intensive AI and Big Data applications [DLC+22, MTH+22].

To sum up, these three research goals, and respective contributions, will enable the next generation of HPC infrastructures and services that can efficiently meet the demands of Big Data and AI workloads.

References

[DLC+22] Dantas, M., Leitão, D., Cui, P., Macedo, R., Liu, X., Xu, W., Paulo, J., 2022. Accelerating Deep Learning Training Through Transparent Storage Tiering. IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)

[ECS+17] Joseph, E., Conway, S., Sorensen, B., Thorp, M., 2017. Trends in the Worldwide HPC Market (Hyperion Presentation). HPC User Forum at HLRS.

[FMP21] Faria, A., Macedo, R., Paulo, J., 2021. Pods-as-Volumes: Effortlessly Integrating Storage Systems and Middleware into Kubernetes. Workshop on Container Technologies and Container Clouds (WoC’21).

[KWG+13] Katal, A., Wazid, M. and Goudar, R.H., 2013. Big data: issues, challenges, tools and good practices. International conference on contemporary computing (IC3).

[NCR+18] Netto, M.A., Calheiros, R.N., Rodrigues, E.R., Cunha, R.L. and Buyya, R., 2018. HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Computing Surveys (CSUR).

[MTH+22] Macedo, R., Tanimura, Y., Haga, J., Chidambaram, V., Pereira, J., Paulo, J., 2022. PAIO: General, Portable I/O Optimizations With Minor Application Modifications. USENIX Conference on File and Storage Technologies (FAST).

[OG+15] Osseyran, A. and Giles, M. eds., 2015. Industrial applications of high-performance computing: best global practices.

[RD+15] Reed, D.A. and Dongarra, J., 2015. Exascale computing and big data. Communications of the ACM.

[ENO+21] Esteves, T., Neves, F., Oliveira, R., Paulo, J., 2021. CaT: Content-aware Tracing and Analysis for Distributed Systems. ACM/IFIP Middleware conference (Middleware).

[YDI+16] Yildiz, O., Dorier, M., Ibrahim, S., Ross, R. and Antoniu, G., 2016, May. On the root causes of cross-application I/O interference in HPC storage systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS).

By INESC TEC

The post HPC meets AI and Big Data first appeared on RISC2 Project.

Webinar: HPC system and job monitoring with LLview

wp_risc — Tue, 26 Jul 2022 12:39:25 +0000

Date: December 7, 2022 | 4 p.m. (UTC)

Speakers: Vitor Silva and Filipe Guimarães, Jülich Supercomputer Centre

Moderator: Esteban Mocskos, Universidad de Buenos Aires

Check the speakers’ presentation slides here.

LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance Computing systems. With the emergence of large heterogeneous machines, in the range of Exascale, the challenges of monitoring such huge systems increase significantly. To address that, LLview is under continuous development in order to work for a wide range of hardware systems and software interfaces with negligible overhead and at the same time providing fast, reliable access to job reports, system-wide monitoring data, and real-time system information. That information is provided to system users, project advisors, support teams and system administrators, helping the managing of jobs, identification of performance issues at many levels and also helping the system administrators to find failures and system malfunctions. This webinar gives an overview of the different LLview components and their interaction with each other and the system. Moreover, particular attention is drawn to the system monitoring views and the job reporting features, as they allow to trace the entire life cycle of a job and can help identify problems and bottlenecks at a very early stage.

About the Speakers:

Vitor Silva received his Computer Science degree from Universiade Federal de Minas Gerais. His M.Sc was earned in Systems and Computer Engineering from Universidade Federal do Rio de Janeiro and later received his Ph.D from Universidade Federal de Minas Gerais, this time in Nuclear Engineering. He worked as software developer in the digital image processing field, but most of his career was in the Nuclear Engineering field, mainly working with computer modeling and solving Neutronics and Thermal-hydraulics problems related to nuclear reactors. He was also the main admin of a small cluster system installed from scratch. Since 2021 he has been working at the Jülich Supercomputing Centre with monitoring tools and simulation.

Filipe Guimarães is a computational physicist. Graduated in Physics, M.Sc in Physics and Ph.D in Physics from the Universidade Federal Fluminense. He has been working with High Performance Computing since 2014 – initially from a user’s side, but moved to the support side in 2020. Since then, one of his focuses was to improve monitoring tools used and developed at the Jülich Supercomputing Centre.

About the Moderator: Esteban Mocskos is a full-time professor at Universidad de Buenos Aires (UBA) and researcher at the Center for Computer Simulation (CSC-CONICET). He received his Ph.D. in Computer Science from UBA in 2008 and was postdoc at the Protein Modelling group at UBA. His research interests include distributed systems & blockchain, computer networks, processor architecture, and parallel programming. He is part of the steering committee of the Latin-American HPC CARLA conference and onE of the committee members of Argentina’s National HPC system.

The post Webinar: HPC system and job monitoring with LLview first appeared on RISC2 Project.

Webinar: Application Benchmarking with JUBE: Lessons Learned

wp_risc — Tue, 26 Jul 2022 12:36:09 +0000

Date: October 19, 2022 | 4 p.m. (UTC+1)

Speaker: Marc-André Hermanns, RWTH Aachen

Moderator: Bernd Mohr, Jülich Supercomputer Centre

JUBE can help in the automating application benchmarking on a given platform. JUBE’s features in automatic sandboxing and parameter-space creation can assist to easily sweep build and runtime parameters for an application on a given platform to identify the best build and run configuration.

This talk provides some lessons learned in building a JUBE-based benchmark Suite for the RWTH Aachen University Job-Mix that reduces redundancy of information and allows for easy integration of future applications. It will specifically address advanced features for parameter settings, parameter inheritance, and some tips and tricks to overcome some of its limitations.

About the speaker: Marc-André Hermanns is a member of the HPC group at the IT Center of RWTH Aachen University. His research focuses on tools and interfaces for the performance analysis of parallel applications. He has been involved in the design and implementation of various courses on topics of parallel programming for high-performance computing. Next to supporting HPC users as part of the competence network for high-performance computing in North-Rhinewestphalia (HPC.NRW), he also contributes to the development of online tutorials and courses within the competence network. He is a long time user and advocator for JUBE and created configurations for various applications and benchmarks, both for classical system benchmarking, as well as integration of performance analysis tools in such workflows.

About the moderator: Bernd Mohr started to design and develop tools for performance analysis of parallel programs already with his diploma thesis (1987) at the University of Erlangen in Germany, and continued this in his Ph.D. work (1987 to 1992). During a three year postdoc position at the University of Oregon, he designed and implemented the original TAU performance analysis framework. Since 1996 he has been a senior scientist at Forschungszentrum Juelich. Since 2000, he has been the team leader of the group ”Programming Environments and Performance Analysis”. Besides being responsible for user support and training in regard to performance tools at the Juelich Supercomputing Centre (JSC), he is leading the Scalasca performance tools efforts in collaboration with Prof. Felix Wolf of TU Darmstadt. Since 2007, he has also served as deputy head for the JSC division ”Application support”. He was an active member in the International Exascale Software Project (IESP/BDEC) and work package leader in the European (EESI2) and Juelich (EIC, ECL) Exascale efforts. For the SC and ISC Conference series, he served on the Steering Committee. He is the author of several dozen conference and journal articles about performance analysis and tuning of parallel programs.

Registrations are now closed.

The post Webinar: Application Benchmarking with JUBE: Lessons Learned first appeared on RISC2 Project.