SC23 BoF: Enabling I/O and Computation Malleability in High-Performance Computing – Adaptative Multi-tier Intelligent data manager for Exascale

The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC23)

Nov 12–17, 2023 • Denver, CO.

BoF Proposal.

Description: Traditional interest in increasing parallelism for individual jobs in HPC systems has been impressed by the variety and dynamicity of resource demands of jobs, both applications and workflows, at runtime. Malleability techniques can help to dynamically adapt resource usage dynamically to achieve maximum efficiency by adjusting the computation and storage needs of applications, on the one side, and the allocation of hardware resources to them, on the other, when applications enter into execution phases requiring less or more – or different – computational or storage resources than those currently allocated. Malleable HPC systems, however, face a series of fundamental research challenges, such as resource management, scheduling, malleability control, flexibilisation of application structures, and data movement. All aforementioned issues will be addressed in the proposed Birds of a Feather session, which aims at building a community of developers and users around the topic of malleability in high-performance computing,networking and storage.

Goal: The goal of this BoF session is to discuss malleability techniques and their impact on applications and systems. We will use the BoF to solicit input from interested parties to drive the development of future academic and commercial solutions aimed at supporting malleability in computing and I/O, with the final objective of including them in standards, such as MPI or PMIx.

Topics: Malleable systems, however, face a series of fundamental research challenges, including: who initiates changes? How is it communicated to applications? How to determine the optimal usage of the available resources? How can applications cope with dynamically changing resources? What should malleable programming models and abstractions look like? How to design scalable resource management frameworks for malleable systems? Which resources may benefit from malleability, and which (if any) should still be managed statically? To advance in the solutions of those challenges, the BoF session will focus on the following topics of discussion: System architecture considerations to enable efficient implementation of malleability; Runtimes, parallel programming models and techniques, and libraries supporting malleability; Malleable scheduling and load distribution considering multicriteria aspects, such as computing, I/O, fault tolerance, and energy efficiency; Potential usage of AI techniques to steer malleability in systems and applications; Support for malleable applications in performance, debugging and correctness tools.

Format: We plan on inviting a small set of invited speakers to present their thoughts on needs and challenges of malleability, following the questions posed above. For the speakers, we plan on making the invitation of speakers from underrepresented minorities and/or underrepresented or minority serving institutions a priority. These presentation part of the BoF will cover at most half of the BoF – the remaining time will be used for open discussions between the speakers and the audience in panel style.

Expected HPC audience: In order to address the aforementioned challenges, this BoF session will bring together researchers from diverse areas of HPC, AI and Data processing that are impacted or actively pursuing malleability and dynamic resource concepts, from application developers to system software researchers and system architects. This BoF will be directly announced over different channels such as all known-to-us EuroHPC projects working on malleability and dynamic resources, teams of different in-production schedulers (SLURM/OAR), and the researchers involved in the planned extension of the Flux scheduler on dynamic resources. The BoF will provide a lively discussion for researchers and vendors working in HPC and pursuing the concepts of and around malleability. Experiences and use cases applying malleability to HPC applications and runtimes are specially welcome to the discussion.

Expected outcome: Identifying challenges and future perspectives to support I/O and computation malleability in HPC and proposals of solutions to them. Strengthen or establish new international research and collaborations on malleability topics. Pushing the definition of a roadmap for the adoption standards specific for malleability. Raise awareness of this topic to the HPC community.

Organizing committee:

Prof Jesus Carretero. Universidad Carlos III of Madrid, Spain
Prof Martin Schreiber. Université Grenoble Alpes, France
Prof Martin Schulz. Technical University Munich, Germany
Prof. Estela Suarez. Forschungszentrum Juelich & University of Bonn, Germany
Dr. Antonio J. Peña. Barcelona Supercomputing Center. Spain
Dr. Tapasya Patky. Lawrence Livermore National Laboratory. USA

Keywords:

Maleabillity
Resource management
Dynamic allocation

Previous edition