Dras: Deep reinforcement learning for cluster scheduling in high performance computing

Y Fan, B Li, D Favorite, N Singh… - … on Parallel and …, 2022 - ieeexplore.ieee.org
Cluster schedulers are crucial in high-performance computing (HPC). They determine when
and which user jobs should be allocated to available system resources. Existing cluster …

Improving HPC system performance by predicting job resources via supervised machine learning

M Tanash, B Dunn, D Andresen, W Hsu… - … and Experience in …, 2019 - dl.acm.org
High-Performance Computing (HPC) systems are resources utilized for data capture,
sharing, and analysis. The majority of our HPC users come from other disciplines than …

GRAP: group-level resource allocation policy for reconfigurable Dragonfly network in HPC

G Feng, D Dong, S Zhao, Y Lu - … of the 37th International conference on …, 2023 - dl.acm.org
Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has
been adopted in new exascale High Performance Computing (HPC) systems. However …

Exploring job running path to predict runtime on multiple production supercomputers

W Yang, X Liao, D Dong, J Yu - Journal of Parallel and Distributed …, 2023 - Elsevier
There are massive jobs submitted in the supercomputer, and the job management system is
typically deployed to schedule these jobs and allocate compute resources. FCFS (First …

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Q Ding, P Zheng, S Kudari, S Venkataraman… - Proceedings of the …, 2023 - dl.acm.org
Accommodating long-running deep learning (DL) training and inference jobs is challenging
on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock …

Evaluating slurm simulator with real-machine slurm and vice versa

A Jokanovic, M D'Amico… - 2018 IEEE/ACM …, 2018 - ieeexplore.ieee.org
Having a precise and a fast job scheduler model that resembles the real-machine job
scheduling software behavior is extremely important in the field of job scheduling. The idea …

Alea–complex job scheduling simulator

D Klusáček, M Soysal, F Suter - International Conference on Parallel …, 2019 - Springer
Using large computer systems such as HPC clusters up to their full potential can be hard.
Many problems and inefficiencies relate to the interactions of user workloads and system …

Ensemble prediction of job resources to improve system performance for slurm-based hpc systems

M Tanash, H Yang, D Andresen, W Hsu - Practice and Experience in …, 2021 - dl.acm.org
In this paper, we present a novel methodology for predicting job resources (memory and
time) for submitted jobs on HPC systems. Our methodology based on historical jobs data …

Optimizing Idle Power of HPC Systems: Practical Insights and Methods

T Ilsche, S Schrader, R Schöne - 2024 IEEE International …, 2024 - ieeexplore.ieee.org
Energy costs are a critical consideration for operating High-Performance Computing (HPC)
systems, with significant efforts dedicated to reducing the energy expenditure of active …

Slurm simulator: Improving slurm scheduler performance on large hpc systems by utilization of multiple controllers and node sharing

NA Simakov, RL DeLeon, MD Innus… - Proceedings of the …, 2018 - dl.acm.org
A Slurm simulator was used to study the potential benefits of using multiple Slurm controllers
and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub …