Design and performance characterization of radical-pilot on leadership-class platforms

A Merzky, M Turilli, M Titov… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Many extreme scale scientific applications have workloads comprised of a large number of
individual high-performance tasks. The Pilot abstraction decouples workload specification …

Machine learning assisted HPC workload trace generation for leadership scale storage systems

AK Paul, JY Choi, AM Karimi, F Wang - Proceedings of the 31st …, 2022 - dl.acm.org
Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in
maintaining mission-critical performance in a large-scale, multi-user, parallel storage …

Scheduling distributed I/O resources in HPC systems

A Bandet, F Boito, G Pallez - European Conference on Parallel Processing, 2024 - Springer
This paper presents a comprehensive investigation on optimizing I/O performance in the
access to distributed I/O resources in high-performance computing (HPC) environments. I/O …

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

X Chu, D Hofstätter, S Ilager, S Talluri… - 2024 IEEE 30th …, 2024 - ieeexplore.ieee.org
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …

Design and evaluation of a simple data interface for efficient data transfer across diverse storage

Z Liu, R Kettimuthu, J Chung… - ACM Transactions on …, 2021 - dl.acm.org
Modern science and engineering computing environments often feature storage systems of
different types, from parallel file systems in high-performance computing centers to object …

Hflow: A dynamic and elastic multi-layered i/o forwarder

J Cernuda, H Devarajan, L Logan… - 2021 IEEE …, 2021 - ieeexplore.ieee.org
Modern applications are highly data-intensive, leading to the well-known I/O bottleneck
problem. Scientists have proposed the placement of fast intermediate storage resources …

Mobilizing underutilized storage nodes via job path: A job-aware file stri** approach

G **an, W Yang, Y Tan, J Feng, Y Li, J Zhang, J Yu - Parallel Computing, 2024 - Elsevier
Users' limited understanding of the storage system architecture prevents them from fully
utilizing the parallel I/O capability of the storage system, leading to a negative impact on the …

FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks

Z Liu, R Kettimuthu, ME Papka… - 2023 IEEE/ACM 23rd …, 2023 - ieeexplore.ieee.org
Supercomputer scheduling policies commonly result in many transient idle nodes, a
phenomenon that is only partially alleviated by backfill scheduling methods that promote …

I/O-signature-based feature analysis and classification of high-performance computing applications

JW Park, X Huang, JK Lee, T Hong - Cluster Computing, 2024 - Springer
The demand for high-performance computing (HPC) resources in computing fields such as
machine learning has increased significantly in recent years. Computing power has been …

Infrastructure Engineering: A Still Missing, Undervalued Role in the Research Ecosystem

V Sochat - arxiv preprint arxiv:2405.10473, 2024 - arxiv.org
Research has become increasingly reliant on software, serving as the driving force behind
bioinformatics, high performance computing, physics, machine learning and artificial …