1+1 is not 2 in I/O: Interference Mitigation

I/O Scheduling is not an easy task. Some time ago, in the realm of HDD technology, our primary concern was sending (or receiving) data to the HDD in a well-organized manner, in large blocks to minimize disk movements. Every disk movement was a wasted time slot. Numerous techniques and schedulers were developed through simulations. Accurate … Read more

Modeling I/O Performance With Extra-P

In high-performance computing (HPC), large scientific applications are usually executed on huge clusters with a vast number of resources. Twice a year, the Top 500 list presents the top performers in this field. As these systems increasingly become more complex and powerful, so do the applications across various domains (e.g., fluid dynamics, molecular dynamics, and … Read more

Deep Learning and Dynamic Ressources

Applications  In the ADMIRE project, the dynamic allocation of resources to jobs is crucial. This is particularly true for applications that involve training Deep Learning (DL) models on large datasets. DL has unleashed advances in applications from various disciplines, such as physics or medicine,  reaching unprecedented performances compared to traditional Machine Learning. Remote sensing One … Read more

Quality of Service at Scale

the ADMIRE approach In large systems HPC, the common assumption that resource sharing can result from self-organization in the name of a common good tends to not match observation. Fig. 2: Resource sharing as hoped, and as observed on a large HPC systems The inherent complexity of HPC systems, the difficulties for end-users to get … Read more

Towards I/O monitoring at scale

Designing a self-tuning I/O environment in HPC Download in PDF I/O Challenges in HPC In High-Performance Computing (HPC) data movements are one of the biggest challenges. Indeed, large computation is necessarily leading to large datasets. Current HPC workflows favor a feed-forward way of launching programs, loading their dataset, and then storing the result in persistent … Read more

Closing the loop: from Observation to Action

Performance monitoring and observation is a requirement in the complex IT systems we are building nowaday. Exascale systems are digital factories operating with millions of cores and discrete components. As any factory, these systems are instrumented and monitored. Performance observation is facing three main challenges: Operating at scale ADMIRE monitoring infrastructure is using Prometheus as … Read more