hpc administrators - RISC2 Project https://www.risc2-project.eu Mon, 11 Sep 2023 15:02:49 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 HPC, Data & Architecture Week https://www.risc2-project.eu/events/hpc-data-architecture-week/ Mon, 30 Jan 2023 12:00:26 +0000 https://www.risc2-project.eu/?post_type=mec-events&p=2693 RISC2 is supporting the organization of the HPC, Data & Architecture Week, which will take place between March 13-17, 2023. It is a comprehensive school of High Performance Computing (HPC), new computing architectures and large volumes of data processing. The simultaneous realization of five tracks aim at different training and needs of undergraduate and postgraduate […]

The post HPC, Data & Architecture Week first appeared on RISC2 Project.

]]>
RISC2 is supporting the organization of the HPC, Data & Architecture Week, which will take place between March 13-17, 2023.

It is a comprehensive school of High Performance Computing (HPC), new computing architectures and large volumes of data processing. The simultaneous realization of five tracks aim at different training and needs of undergraduate and postgraduate students, as well as cluster administrators.

This school hopes to provide this scope of diffusion and approach to users and administrators of HPC equipment in the country. They plan to give specific technical courses for each audience and common activities that promote interaction between them.

Know more information about the school here. 

The post HPC, Data & Architecture Week first appeared on RISC2 Project.

]]>
First School of HPC Administrators in Latin America and the Caribbean: A space for the formation of computational thinking https://www.risc2-project.eu/2022/10/31/first-school-of-hpc-administrators-in-latin-america-and-the-caribbean-a-space-for-the-formation-of-computational-thinking/ Mon, 31 Oct 2022 09:33:11 +0000 https://www.risc2-project.eu/?p=2533 From the top 500 High performance computing systems of the world, only 6 are placed in Latin America; this makes patent the need to develop and gather technological efforts; which, by many social and economic issues are placed in second place. The HPC tools are used for economic, demographic, weather and social analysis, even for […]

The post First School of HPC Administrators in Latin America and the Caribbean: A space for the formation of computational thinking first appeared on RISC2 Project.

]]>
From the top 500 High performance computing systems of the world, only 6 are placed in Latin America; this makes patent the need to develop and gather technological efforts; which, by many social and economic issues are placed in second place. The HPC tools are used for economic, demographic, weather and social analysis, even for life savings when taken to medicine appliances, achieving a direct impact in decision making based on science.

The NLHPC staff  set their  fundamental pillar to focus  efforts on the scientific community and show HPC as an essential tool for country development by getting users from diverging scientific areas, industry and public sector. This entails breaking access barriers to this kind of technology. NLHPC faces this challenge by making training for the basic use of HPC  and scientific software optimization;  which is key in order to make a good use of resources.

The training was carried out within a framework of computational thinking, being the process by which an individual, through his professional experience and acquired knowledge, manages to face problems of different kinds. This could be evidenced in our active participation in the resolution of the proposed activities, which enhanced our abstraction and engineering thinking. We will certainly take this vision of education and collaborative work to our professional environment, in the different roles we play as HPC administrators, teachers and students.

The proper use of computing services involves efforts to perform monitoring, control and infrastructure management tasks. With the help of the tools reviewed during our visit, we will be able to provide our users with the highest standards of quality, security and accessibility.

The joint effort of the RISC2 and EU-CELAC ResInfra projects made it possible for engineers from Colombia, Mexico and Peru to participate in this HPC management course, learn about Chilean culture, gain knowledge and valuable contacts for our profession.

After living this great experience, we hope that in the near future other supercomputing centers replicate this type of initiatives in other parts of the world, thus increasing the communication bridges between HPC administrators from different places, sharing knowledge and experiences.

We are left with the milestone of being part of the First School of HPC Administrators of Latin America and the Caribbean, with experiences that made us grow in professional, academic, and human aspects. As well as with alliances among colleagues and now friends, a network of support as brothers of the same region.

We conclude by thanking Rafael Mayo of CIEMAT for the initiative; Ginés Guerrero, Pedro Schürmann, Eugenio Guerra, Pablo Flores, Angelo Guajardo, Esteban Osorio, José Morales for the knowledge and experiences shared; RISC2 and EU-CELAC ResInfra for providing us with this learning opportunity, supporting the scholarship grant.

By:

Miguel Angel Barrera Arbelaez, Universidad de los Andes, Colombia

Carlos Enrique Mosquera Trujillo, Centro de bioinformática y biología computacional de Colombia BIOS, Colombia

César Alexander Bernal Díaz, Universidad Industrial de Santander, Colombia.

Eduardo Romero Arzate, Universidad Autónoma Metropolitana, México.

Ronald Darwin Apaza Veliz, Universidad Nacional de San Agustín, Perú.

Joel Gonzalez Lara, Centro de Análisis de Datos y Supercómputo, México

The post First School of HPC Administrators in Latin America and the Caribbean: A space for the formation of computational thinking first appeared on RISC2 Project.

]]>
RISC2 supported the first school of HPC Administrators in Latin America and Caribe https://www.risc2-project.eu/2022/10/31/risc2-supported-the-first-school-of-hpc-administrators-in-latin-america-and-caribe/ Mon, 31 Oct 2022 09:15:05 +0000 https://www.risc2-project.eu/?p=2529 The National Laboratory of High Performance Computing (NLHPC), our partner from Chile, was the responsible for the first school of HPC Administrators in Latin America and Caribe. RISC2, in a joint effort with the EU-CELAC ResInfra, supported the travel costs of 6 engineers to participate in the school, which took place between October 17 and […]

The post RISC2 supported the first school of HPC Administrators in Latin America and Caribe first appeared on RISC2 Project.

]]>
The National Laboratory of High Performance Computing (NLHPC), our partner from Chile, was the responsible for the first school of HPC Administrators in Latin America and Caribe. RISC2, in a joint effort with the EU-CELAC ResInfra, supported the travel costs of 6 engineers to participate in the school, which took place between October 17 and 28, 2022, in Santiago de Chile.

This school aimed to train HPC sysadmins with the latest technologies in supercomputing in a two-week training program, and discussed different topics, such as compilations, visualization and monitoring tools, networking, security tools, and installation, configuration and use of SLURM and EasyBuild, among many others.

According to Ginés Guerrero, the Executive Director of the NLHPC and one of the organizers of this training, “the NLHPC team wanted to pass on the knowledge gained for more than a decade to other administrators, so they can benefit from our experience. This has involved a great effort by a team of 7 engineers, putting aside all their tasks for several weeks to prepare an intensive 64-hour school from scratch. In addition, this process has been tailor-made, since the students indicated their own interests through a form.”

In total, the event had 8 participants from various countries: 2 from Mexico, 3 from Colombia, 2 from Chile, and 1 from Peru, leveraging the international networking opportunities and promoting closer relations between the administrators of various supercomputing centers in Latin America, the main goal of the RISC2 project. A team of 7 engineers (Guinés Guerrero, Pedro Schürmann, Eugenio Guerra, Pablo Flores, Ángelo Guajardo, Esteban Osorio, and José Morales) from NLHPC was responsible for delivering all the 35 lectures.

The post RISC2 supported the first school of HPC Administrators in Latin America and Caribe first appeared on RISC2 Project.

]]>
First School of HPC Administrators in Latin America and Caribe https://www.risc2-project.eu/events/first-school-of-hpc-administrators-in-latina-america-and-caribe/ Thu, 13 Oct 2022 09:24:18 +0000 https://www.risc2-project.eu/?post_type=mec-events&p=2475

The post First School of HPC Administrators in Latin America and Caribe first appeared on RISC2 Project.

]]>

The post First School of HPC Administrators in Latin America and Caribe first appeared on RISC2 Project.

]]>
HPC meets AI and Big Data https://www.risc2-project.eu/2022/10/06/hpc-meets-ai-and-big-data/ Thu, 06 Oct 2022 08:23:34 +0000 https://www.risc2-project.eu/?p=2413 HPC services are no longer solely targeted at highly parallel modelling and simulation tasks. Indeed, the computational power offered by these services is now being used to support data-centric Big Data and Artificial Intelligence (AI) applications. By combining both types of computational paradigms, HPC infrastructures will be key for improving the lives of citizens, speeding […]

The post HPC meets AI and Big Data first appeared on RISC2 Project.

]]>
HPC services are no longer solely targeted at highly parallel modelling and simulation tasks. Indeed, the computational power offered by these services is now being used to support data-centric Big Data and Artificial Intelligence (AI) applications. By combining both types of computational paradigms, HPC infrastructures will be key for improving the lives of citizens, speeding up scientific breakthrough in different fields (e.g., health, IoT, biology, chemistry, physics), and increasing the competitiveness of companies [OG+15, NCR+18].

As the utility and usage of HPC infrastructures increases, more computational and storage power is required to efficiently handle the amount of targeted applications. In fact, many HPC centers are now aiming at exascale supercomputers supporting at least one exaFLOPs (1018 operations per second), which represents a thousandfold increase in processing power over the first petascale computer deployed in 2008 [RD+15]. Although this is a necessary requirement for handling the increasing number of HPC applications, there are several outstanding challenges that still need to be tackled so that this extra computational power can be fully leveraged. 

Management of large infrastructures and heterogeneous workloads: By adding more compute and storage nodes, one is also increasing the complexity of the overall HPC distributed infrastructure and making it harder to monitor and manage. This complexity is increased due to the need of supporting highly heterogeneous applications that translate into different workloads with specific data storage and processing needs [ECS+17]. For example, on the one hand, traditional scientific modeling and simulation tasks require large slices of computational time, are CPU-bound, and rely on iterative approaches (parametric/stochastic modeling). On the other hand, data-driven Big Data applications contemplate shorter computational tasks, that are I/O bound and, in some cases, have real-time response requirements (i.e., latency-oriented). Also, many of the applications leverage AI and machine learning tools that require specific hardware (e.g., GPUs) in order to be efficient.

Support for general-purpose analytics: The increased heterogeneity also demands that HPC infrastructures are now able to support general-purpose AI and BigData applications that were not designed explicitly to run on specialised HPC hardware [KWG+13]. Therefore, developers are not required to significantly change their applications so that they can execute efficiently at HPC clusters.

Avoiding the storage bottleneck: By only increasing the computational power and improving the management of HPC infrastructures it may still not be possible to fully harmed the capabilities of these infrastructures. In fact, Big Data and AI applications are data-driven and require efficient data storage and retrieval from HPC clusters. With an increasing number of applications and heterogeneous workloads, the storage systems supporting HPC may easily become a bottleneck [YDI+16, ECS+17]. Indeed, as pointed out by several studies, the storage access time is one of the major bottlenecks limiting the efficiency of current and next-generation HPC infrastructures. 

In order to address these challenges, RISC2 partners are exploring: New monitoring and debugging tools that can aid in the analysis of complex AI and Big Data workloads in order to pinpoint potential performance and efficiency bottlenecks, while helping system administrators and developers on troubleshooting these [ENO+21].

Emerging virtualization technologies, such as containers, that enable users to efficiently deploy and execute traditional AI and BigData applications in an HPC environment, without requiring any changes to their source-code [FMP21].  

The Software-Defined Storage paradigm in order to improve the Quality-of-Service (QoS) for HPC’s storage services when supporting hundreds to thousands of data-intensive AI and Big Data applications [DLC+22, MTH+22].  

To sum up, these three research goals, and respective contributions, will enable the next generation of HPC infrastructures and services that can efficiently meet the demands of Big Data and AI workloads. 

 

References

[DLC+22] Dantas, M., Leitão, D., Cui, P., Macedo, R., Liu, X., Xu, W., Paulo, J., 2022. Accelerating Deep Learning Training Through Transparent Storage Tiering. IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid)  

[ECS+17] Joseph, E., Conway, S., Sorensen, B., Thorp, M., 2017. Trends in the Worldwide HPC Market (Hyperion Presentation). HPC User Forum at HLRS.  

[FMP21] Faria, A., Macedo, R., Paulo, J., 2021. Pods-as-Volumes: Effortlessly Integrating Storage Systems and Middleware into Kubernetes. Workshop on Container Technologies and Container Clouds (WoC’21). 

[KWG+13] Katal, A., Wazid, M. and Goudar, R.H., 2013. Big data: issues, challenges, tools and good practices. International conference on contemporary computing (IC3). 

[NCR+18] Netto, M.A., Calheiros, R.N., Rodrigues, E.R., Cunha, R.L. and Buyya, R., 2018. HPC cloud for scientific and business applications: Taxonomy, vision, and research challenges. ACM Computing Surveys (CSUR). 

[MTH+22] Macedo, R., Tanimura, Y., Haga, J., Chidambaram, V., Pereira, J., Paulo, J., 2022. PAIO: General, Portable I/O Optimizations With Minor Application Modifications. USENIX Conference on File and Storage Technologies (FAST). 

[OG+15] Osseyran, A. and Giles, M. eds., 2015. Industrial applications of high-performance computing: best global practices. 

[RD+15] Reed, D.A. and Dongarra, J., 2015. Exascale computing and big data. Communications of the ACM. 

[ENO+21] Esteves, T., Neves, F., Oliveira, R., Paulo, J., 2021. CaT: Content-aware Tracing and Analysis for Distributed Systems. ACM/IFIP Middleware conference (Middleware). 

[YDI+16] Yildiz, O., Dorier, M., Ibrahim, S., Ross, R. and Antoniu, G., 2016, May. On the root causes of cross-application I/O interference in HPC storage systems. IEEE International Parallel and Distributed Processing Symposium (IPDPS). 

 

By INESC TEC

The post HPC meets AI and Big Data first appeared on RISC2 Project.

]]>
Webinar: HPC system and job monitoring with LLview https://www.risc2-project.eu/events/webinar-4-hpc-system-and-job-monitoring-with-llview/ Tue, 26 Jul 2022 12:39:25 +0000 https://www.risc2-project.eu/?post_type=mec-events&p=2245 Date: December 7, 2022 | 4 p.m. (UTC) Speakers: Vitor Silva and Filipe Guimarães, Jülich Supercomputer Centre Moderator: Esteban Mocskos, Universidad de Buenos Aires Check the speakers’ presentation slides here.  LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance […]

The post Webinar: HPC system and job monitoring with LLview first appeared on RISC2 Project.

]]>

Date: December 7, 2022 | 4 p.m. (UTC)

Speakers: Vitor Silva and Filipe Guimarães, Jülich Supercomputer Centre

Moderator: Esteban Mocskos, Universidad de Buenos Aires

Check the speakers’ presentation slides here. 

LLview is a monitoring infrastructure developed by the Jülich Supercomputing Centre with the objective to provide an easy to use and adaptable software suite for monitoring High Performance Computing systems. With the emergence of large heterogeneous machines, in the range of Exascale, the challenges of monitoring such huge systems increase significantly. To address that, LLview is under continuous development in order to work for a wide range of hardware systems and software interfaces with negligible overhead and at the same time providing fast, reliable access to job reports, system-wide monitoring data, and real-time system information. That information is provided to system users, project advisors, support teams and system administrators, helping the managing of jobs, identification of performance issues at many levels and also helping the system administrators to find failures and system malfunctions. This webinar gives an overview of the different LLview components and their interaction with each other and the system. Moreover, particular attention is drawn to the system monitoring views and the job reporting features, as they allow to trace the entire life cycle of a job and can help identify problems and bottlenecks at a very early stage.

 

About the Speakers:

Vitor Silva received his Computer Science degree from Universiade Federal de Minas Gerais. His M.Sc was earned in Systems and Computer Engineering from Universidade Federal do Rio de Janeiro and later received his Ph.D from Universidade Federal de Minas Gerais, this time in Nuclear Engineering. He worked as software developer in the digital image processing field, but most of his career was in the Nuclear Engineering field, mainly working with computer modeling and solving Neutronics and Thermal-hydraulics problems related to nuclear reactors. He was also the main admin of a small cluster system installed from scratch. Since 2021 he has been working at the Jülich Supercomputing Centre with monitoring tools and simulation.

Filipe Guimarães is a computational physicist. Graduated in Physics, M.Sc in Physics and Ph.D in Physics from the Universidade Federal Fluminense. He has been working with High Performance Computing since 2014 – initially from a user’s side, but moved to the support side in 2020. Since then, one of his focuses was to improve monitoring tools used and developed at the Jülich Supercomputing Centre.

About the Moderator: Esteban Mocskos is a full-time professor at Universidad de Buenos Aires (UBA) and researcher at the Center for Computer Simulation (CSC-CONICET). He received his Ph.D. in Computer Science from UBA in 2008 and was postdoc at the Protein Modelling group at UBA. His research interests include distributed systems & blockchain, computer networks, processor architecture, and parallel programming. He is part of the steering committee of the Latin-American HPC CARLA conference and onE of the committee members of Argentina’s National HPC system.

The post Webinar: HPC system and job monitoring with LLview first appeared on RISC2 Project.

]]>