Design and performance characterization of radical-pilot on leadership-class platforms
Many extreme scale scientific applications have workloads comprised of a large number of
individual high-performance tasks. The Pilot abstraction decouples workload specification …
individual high-performance tasks. The Pilot abstraction decouples workload specification …
Machine learning assisted HPC workload trace generation for leadership scale storage systems
Monitoring and analyzing a wide range of I/O activities in an HPC cluster is important in
maintaining mission-critical performance in a large-scale, multi-user, parallel storage …
maintaining mission-critical performance in a large-scale, multi-user, parallel storage …
Scheduling distributed I/O resources in HPC systems
This paper presents a comprehensive investigation on optimizing I/O performance in the
access to distributed I/O resources in high-performance computing (HPC) environments. I/O …
access to distributed I/O resources in high-performance computing (HPC) environments. I/O …
Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis
HPC datacenters offer a backbone to the modern digital society. Increasingly, they run
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …
Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting …
Design and evaluation of a simple data interface for efficient data transfer across diverse storage
Modern science and engineering computing environments often feature storage systems of
different types, from parallel file systems in high-performance computing centers to object …
different types, from parallel file systems in high-performance computing centers to object …
Hflow: A dynamic and elastic multi-layered i/o forwarder
Modern applications are highly data-intensive, leading to the well-known I/O bottleneck
problem. Scientists have proposed the placement of fast intermediate storage resources …
problem. Scientists have proposed the placement of fast intermediate storage resources …
Mobilizing underutilized storage nodes via job path: A job-aware file stri** approach
G **an, W Yang, Y Tan, J Feng, Y Li, J Zhang, J Yu - Parallel Computing, 2024 - Elsevier
Users' limited understanding of the storage system architecture prevents them from fully
utilizing the parallel I/O capability of the storage system, leading to a negative impact on the …
utilizing the parallel I/O capability of the storage system, leading to a negative impact on the …
FreeTrain: A Framework to Utilize Unused Supercomputer Nodes for Training Neural Networks
Supercomputer scheduling policies commonly result in many transient idle nodes, a
phenomenon that is only partially alleviated by backfill scheduling methods that promote …
phenomenon that is only partially alleviated by backfill scheduling methods that promote …
I/O-signature-based feature analysis and classification of high-performance computing applications
The demand for high-performance computing (HPC) resources in computing fields such as
machine learning has increased significantly in recent years. Computing power has been …
machine learning has increased significantly in recent years. Computing power has been …
Infrastructure Engineering: A Still Missing, Undervalued Role in the Research Ecosystem
V Sochat - arxiv preprint arxiv:2405.10473, 2024 - arxiv.org
Research has become increasingly reliant on software, serving as the driving force behind
bioinformatics, high performance computing, physics, machine learning and artificial …
bioinformatics, high performance computing, physics, machine learning and artificial …