{"id":2508,"date":"2022-11-21T14:09:42","date_gmt":"2022-11-21T14:09:42","guid":{"rendered":"https:\/\/www.risc2-project.eu\/?p=2508"},"modified":"2022-12-12T10:49:09","modified_gmt":"2022-12-12T10:49:09","slug":"managing-data-and-machine-learning-models-in-hpc-applications","status":"publish","type":"post","link":"https:\/\/risc2-project.eu\/?p=2508","title":{"rendered":"Managing Data and Machine Learning Models in HPC Applications"},"content":{"rendered":"<p>The synergy of data science (including big data and machine learning) and HPC yields many benefits for data-intensive applications in terms of more accurate predictive data analysis and better decision making. For instance, in the context of the HPDaSc <a href=\"https:\/\/team.inria.fr\/zenith\/hpdasc\/\" target=\"_blank\" rel=\"noopener\">(High Performance Data Science)<\/a> project between Inria and Brazil, we have shown the importance of realtime analytics to make critical high-consequence decisions in HPC applications, e.g., preventing useless drilling based on a driller&#8217;s realtime data and realtime visualization of simulated data, or the effectiveness of ML to deal with scientific data, e.g., computing Probability Density Functions (PDFs) over simulated seismic data using Spark.<\/p>\n<p>However, to realize the full potential of this synergy, ML models (or models for short) must be built, combined and ensembled, which can be very complex as there can be many models to select from. Furthermore, they should be shared and reused, in particular, in different execution environments such as HPC or Spark clusters.<\/p>\n<p>To address this problem, we proposed <span class=\"NormalTextRun SpellingErrorV2Themed SpellingErrorHighlight SCXW229846965 BCX9\">Gypscie<\/span><span class=\"NormalTextRun SCXW229846965 BCX9\"> [Porto 2022, <\/span><span class=\"NormalTextRun SCXW229846965 BCX9\">Zorrilla<\/span><span class=\"NormalTextRun SCXW229846965 BCX9\"> 2022], a new framework that supports the entire ML lifecycle and enables model reuse and import from other frameworks. The approach behind Gypscie is to combine several rich capabilities for model and data management, and model execution, which are typically provided by different tools, in a unique framework. Overall, Gypscie provides: a platform for supporting the complete model life-cycle, from model building to deployment, monitoring and policies enforcement; an environment for casual users to find ready-to-use models that best fit a particular prediction problem, an environment to optimize ML task scheduling and execution; an easy way for developers to benchmark their models against other competitive models and improve them; a central point of access to assess models&#8217; compliance to policies and ethics and obtain and curate observational and predictive data; provenance information and model explainability. Finally, Gypscie interfaces with multiple execution environments to run ML tasks, e.g., an HPC system such as the Santos Dumont supercomputer at LNCC or a Spark cluster.\u00a0<\/span><\/p>\n<p>Gypscie comes with SAVIME [Silva 2020], a multidimensional array in-memory database system for importing, storing and querying model (tensor) data. The <a href=\"https:\/\/github.com\/hllustosa\/Savime\" target=\"_blank\" rel=\"noopener\">SAVIME open-source system<\/a> has been developed to support analytical queries over scientific data. Its offers an extremely efficient ingestion procedure, which practically eliminates the waiting time to analyze incoming data. It also supports dense and sparse arrays and non-integer dimension indexing. It offers a functional query language processed by a query optimiser that generates efficient query execution plans.<\/p>\n<p>&nbsp;<\/p>\n<h5>References<\/h5>\n<p><span data-contrast=\"auto\">[Porto 2022] Fabio Porto, Patrick Valduriez: <\/span><a href=\"https:\/\/hal-lirmm.ccsd.cnrs.fr\/lirmm-03799097\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Data and Machine Learning Model Management with Gypscie<\/span><\/a><span data-contrast=\"auto\">. CARLA 2022 &#8211; Workshop on HPC and Data Sciences meet Scientific Computing, SCALAC, Sep 2022, Porto Alegre, Brazil. pp.1-2.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[Zorrilla 2022] Roc\u00edo Zorrilla, Eduardo Ogasawara, Patrick Valduriez, Fabio Porto: <\/span><a href=\"https:\/\/hal-lirmm.ccsd.cnrs.fr\/lirmm-03798483\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">A Data-Driven Model Selection Approach to Spatio-Temporal Prediction<\/span><\/a><span data-contrast=\"auto\">. SBBD 2022 &#8211; Brazilian Symposium on Databases, SBBD, Sep 2022, Buzios, Brazil. pp.1-12.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p><span data-contrast=\"auto\">[Silva 2020] A.C. Silva, H. Louren\u00e7o, D. Ramos, F. Porto, P. Valduriez. <\/span><a href=\"https:\/\/hal-lirmm.ccsd.cnrs.fr\/lirmm-03144324\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Savime: An Array DBMS for Simulation Analysis and Prediction<\/span><\/a><span data-contrast=\"auto\">. Journal of Information Data Management 11(3), 2020.<\/span><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span data-contrast=\"auto\">By <\/span><a href=\"https:\/\/www.lncc.br\/\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">LNCC<\/span><\/a><span data-contrast=\"auto\"> and <\/span><a href=\"https:\/\/www.inria.fr\/en\" target=\"_blank\" rel=\"noopener\"><span data-contrast=\"none\">Inria<\/span><\/a><span data-ccp-props=\"{}\">\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The synergy of data science (including big data and machine learning) and HPC yields many benefits for data-intensive applications in terms of more accurate predictive data analysis and better decision making. For instance, in the context of the HPDaSc (High Performance Data Science) project between Inria and Brazil, we have shown the importance of realtime [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2519,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[2],"tags":[191,227,48,226],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/posts\/2508"}],"collection":[{"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2508"}],"version-history":[{"count":3,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/posts\/2508\/revisions"}],"predecessor-version":[{"id":2616,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/posts\/2508\/revisions\/2616"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=\/wp\/v2\/media\/2519"}],"wp:attachment":[{"href":"https:\/\/risc2-project.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2508"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2508"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/risc2-project.eu\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2508"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}