I/O Scheduling is not an easy task.
Some time ago, in the realm of HDD technology, our primary concern was sending (or receiving) data to the HDD in a well-organized manner, in large blocks to minimize disk movements. Every disk movement was a wasted time slot.
Numerous techniques and schedulers were developed through simulations. Accurate simulators like DiskSim (https://www.pdl.cmu.edu/DiskSim/index.shtml) exhibited this accurate disk behavior. Within the Linux kernel, we had four local I/O schedulers, none proving superior to the others. Each application (or set of applications performing I/O simultaneously) had its own I/O scheduler, yielding better outcomes.
Nowadays, these concerns seem to have faded into the background, as SSDs can handle unordered transfers effectively (though not without issues). Moreover, the use of Parallel File Systems (PFS), essentially a collection of HDDs/SSDs with high network transfer rates, seems to mitigate these concerns. Simulations in such environments tend to be simpler. However, in practice, how we schedule or send I/O requests to the PFS still holds importance.
Nevertheless, we must tread transfers carefully. In the initial stages of ADMIRE, inside the I/O Scheduler, we have agents tasked with transferring data to and from the PFS. We refer to them as “cargo,” which swiftly transmits information to the PFS (or receives it).
Enabling this service doesn’t eliminate interferences. If we have n cargo instances, we’ll have n sources of interference in the network and on the disk, decreasing I/O bandwidth and increasing latency. However, transitioning from unordered and uncontrolled transfers to the PFS to large, ordered data movements helps mitigate these issues. This marks a significant improvement in this workflow stage (stage in – stage out).
Simulating this behavior is a complex task, and a pending one if we aim to develop better models to handle malleable I/O. Yet, current state-of-the-art devices pose challenges. Unlike in the HDD era, SSDs and NVMe devices are closed systems with limited accessible information, leading to inaccurate performance models. One potential solution could involve statistical models derived from real devices under various workloads. Instead of simulating disk movements or different zones, the simulator could read these models and employ a statistical approximation of the time required to process n requests concurrently.
The ADMIRE consortium devised a set of scheduling algorithms, yet there remains a vast unexplored territory. This is evident in the naive cargo Scheduler algorithm: due to SLURM’s limitations, we can only take actions on already launched cargo transfers, either slowing down or attempting to increase transfer speed, leaving the decision to initiate transfers out of our control (which could potentially be paused).
The algorithm (cargo-simple-scheduler) is created in ElastiSim (https://elastisim.github.io/). As cargo jobs are potentially malleable, we attempt to adjust the number of cores used, translating this to real implementation by ensuring fairness and regulating transfer speeds. Although we miss the opportunity to accelerate transfers using more nodes or cores, we believe this approximation is reasonable given the constraints of SLURM.
However, the PFS model inside ElastiSim lacks complexity. Interferences between workloads are virtually non-existent (beyond network pipe or PFS bandwidth usage), so we should take care when interpreting the results. Nevertheless, the I/O scheduler, ADMIRE infrastructure and cargo strive to minimize these interferences. In this context, we can anticipate improved results, on overloaded scenarios, using cargo and organizing transfers compared to scenarios without cargo.
This sets the stage for future endeavors, where a more detailed PFS simulation could facilitate to build better algorithms.