Deep Learning and Dynamic Ressources

Applications  In the ADMIRE project, the dynamic allocation of resources to jobs is crucial. This is particularly true for applications that involve training Deep Learning (DL) models on large datasets. DL has unleashed advances in applications from various disciplines, such as physics or medicine,  reaching unprecedented performances compared to traditional Machine Learning. Remote sensing One … Read more

Quality of Service at Scale

the ADMIRE approach In large systems HPC, the common assumption that resource sharing can result from self-organization in the name of a common good tends to not match observation. Fig. 2: Resource sharing as hoped, and as observed on a large HPC systems The inherent complexity of HPC systems, the difficulties for end-users to get … Read more

Towards I/O monitoring at scale

Designing a self-tuning I/O environment in HPC Download in PDF I/O Challenges in HPC In High-Performance Computing (HPC) data movements are one of the biggest challenges. Indeed, large computation is necessarily leading to large datasets. Current HPC workflows favor a feed-forward way of launching programs, loading their dataset, and then storing the result in persistent … Read more

Closing the loop: from Observation to Action

Performance monitoring and observation is a requirement in the complex IT systems we are building nowaday. Exascale systems are digital factories operating with millions of cores and discrete components. As any factory, these systems are instrumented and monitored. Performance observation is facing three main challenges: Operating at scale ADMIRE monitoring infrastructure is using Prometheus as … Read more