- Academic Search

Y Fan, B Li, D Favorite, N Singh… - … on Parallel and …, 2022 - ieeexplore.ieee.org

Cluster schedulers are crucial in high-performance computing (HPC). They determine when
and which user jobs should be allocated to available system resources. Existing cluster …

Gem Citer Citeret af 25 Relaterede artikler Alle 6 versioner

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Improving HPC system performance by predicting job resources via supervised machine learning

M Tanash, B Dunn, D Andresen, W Hsu… - … and Experience in …, 2019 - dl.acm.org

High-Performance Computing (HPC) systems are resources utilized for data capture,
sharing, and analysis. The majority of our HPC users come from other disciplines than …

Gem Citer Citeret af 65 Relaterede artikler Alle 8 versioner

[Free GPT-4]
[DeepSeek]

[PDF] sjtu.edu.cn

GRAP: group-level resource allocation policy for reconfigurable Dragonfly network in HPC

G Feng, D Dong, S Zhao, Y Lu - … of the 37th International conference on …, 2023 - dl.acm.org

Dragonfly is a highly scalable, low-diameter, and cost-efficient network topology, which has
been adopted in new exascale High Performance Computing (HPC) systems. However …

Gem Citer Citeret af 7 Relaterede artikler Alle 2 versioner

Exploring job running path to predict runtime on multiple production supercomputers

W Yang, X Liao, D Dong, J Yu - Journal of Parallel and Distributed …, 2023 - Elsevier

There are massive jobs submitted in the supercomputer, and the job management system is
typically deployed to schedule these jobs and allocate compute resources. FCFS (First …

Gem Citer Citeret af 8 Relaterede artikler Alle 2 versioner

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Q Ding, P Zheng, S Kudari, S Venkataraman… - Proceedings of the …, 2023 - dl.acm.org

Accommodating long-running deep learning (DL) training and inference jobs is challenging
on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock …

Gem Citer Citeret af 4 Relaterede artikler Alle 11 versioner

[Free GPT-4]
[DeepSeek]

[PDF] upc.edu

Evaluating slurm simulator with real-machine slurm and vice versa

A Jokanovic, M D'Amico… - 2018 IEEE/ACM …, 2018 - ieeexplore.ieee.org

Having a precise and a fast job scheduler model that resembles the real-machine job
scheduling software behavior is extremely important in the field of job scheduling. The idea …

Gem Citer Citeret af 30 Relaterede artikler Alle 6 versioner

[Free GPT-4]
[DeepSeek]

[PDF] hal.science

Alea–complex job scheduling simulator

D Klusáček, M Soysal, F Suter - International Conference on Parallel …, 2019 - Springer

Using large computer systems such as HPC clusters up to their full potential can be hard.
Many problems and inefficiencies relate to the interactions of user workloads and system …

Gem Citer Citeret af 25 Relaterede artikler Alle 6 versioner

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Ensemble prediction of job resources to improve system performance for slurm-based hpc systems

M Tanash, H Yang, D Andresen, W Hsu - Practice and Experience in …, 2021 - dl.acm.org

In this paper, we present a novel methodology for predicting job resources (memory and
time) for submitted jobs on HPC systems. Our methodology based on historical jobs data …

Gem Citer Citeret af 15 Relaterede artikler Alle 6 versioner

[Free GPT-4]
[DeepSeek]

[PDF] tu-dresden.de

Optimizing Idle Power of HPC Systems: Practical Insights and Methods

T Ilsche, S Schrader, R Schöne - 2024 IEEE International …, 2024 - ieeexplore.ieee.org

Energy costs are a critical consideration for operating High-Performance Computing (HPC)
systems, with significant efforts dedicated to reducing the energy expenditure of active …

Gem Citer Citeret af 1 Relaterede artikler Alle 3 versioner

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Slurm simulator: Improving slurm scheduler performance on large hpc systems by utilization of multiple controllers and node sharing

NA Simakov, RL DeLeon, MD Innus… - Proceedings of the …, 2018 - dl.acm.org

A Slurm simulator was used to study the potential benefits of using multiple Slurm controllers
and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub …

Gem Citer Citeret af 23 Relaterede artikler Alle 3 versioner

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

A slurm simulator: Implementation and parametric analysis

Dras: Deep reinforcement learning for cluster scheduling in high performance computing

Improving HPC system performance by predicting job resources via supervised machine learning

GRAP: group-level resource allocation policy for reconfigurable Dragonfly network in HPC

Exploring job running path to predict runtime on multiple production supercomputers

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Evaluating slurm simulator with real-machine slurm and vice versa

Alea–complex job scheduling simulator

Ensemble prediction of job resources to improve system performance for slurm-based hpc systems

Optimizing Idle Power of HPC Systems: Practical Insights and Methods

Slurm simulator: Improving slurm scheduler performance on large hpc systems by utilization of multiple controllers and node sharing