Accelerating collective communication in data parallel training across deep learning frameworks

J Romero, J Yin, N Laanait, B **e, MT Young… - … USENIX Symposium on …, 2022 - usenix.org
This work develops new techniques within Horovod, a generic communication library
supporting data parallel training across deep learning frameworks. In particular, we improve …

High-Quality I/O Bandwidth Prediction with Minimal Data via Transfer Learning Workflow

D Povaliaiev, R Liem, J Kunkel… - 2024 IEEE 36th …, 2024 - ieeexplore.ieee.org
Providing a high-quality performance prediction has the potential to enhance various
aspects of a cluster, such as devising scheduling and provisioning policies, guiding …

Design and implementation of I/O performance prediction scheme on HPC systems through large-scale log analysis

S Kim, A Sim, K Wu, S Byna, Y Son - Journal of Big Data, 2023 - Springer
Large-scale high performance computing (HPC) systems typically consist of many
thousands of CPUs and storage units used by hundreds to thousands of users …

Sctuner: An autotuner addressing dynamic i/o needs on supercomputer i/o subsystems

H Tang, B **e, S Byna, P Carns… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
In high-performance computing (HPC), scientific applications often manage a massive
amount of data using I/O libraries. These libraries provide convenient data model …

AIIO: Using Artificial Intelligence for Job-Level and Automatic I/O Performance Bottleneck Diagnosis

B Dong, JL Bez, S Byna - … of the 32nd International Symposium on High …, 2023 - dl.acm.org
Manually diagnosing the I/O performance bottleneck for a single application (hereinafter
referred to as the" job level'') is a tedious and error-prone procedure requiring domain …

Battle of the defaults: Extracting performance characteristics of HDF5 under production load

B **e, H Tang, S Byna, J Hanley… - 2021 IEEE/ACM 21st …, 2021 - ieeexplore.ieee.org
Popular parallel I/O libraries, such as HDF5, provide tuning parameters to obtain superior
performance. However, the selection of effective parameters on production systems is …

I/O-signature-based feature analysis and classification of high-performance computing applications

JW Park, X Huang, JK Lee, T Hong - Cluster Computing, 2024 - Springer
The demand for high-performance computing (HPC) resources in computing fields such as
machine learning has increased significantly in recent years. Computing power has been …

I/O Behind the Scenes: Bandwidth Requirements of HPC Applications with Asynchronous I/O

A Tarraf, JF Muñoz, DE Singh, T Ozden… - 2024 IEEE …, 2024 - ieeexplore.ieee.org
I/O bandwidth is a critical resource in an HPC cluster. As with all shared resources, its
availability is impacted significantly by the users and the applications they execute. Without …

Report for the ASCR Workshop on the Management and Storage of Scientific Data

S Byna, S Idreos, T Jones, K Mohror, R Ross, F Rusu - 2022 - osti.gov
The purpose of this workshop is to identify priority research directions in the area of data
management for high-performance and scientific computing above and beyond HPC's …

[PDF][PDF] Design and implementation of I/O performance prediction scheme on HPC systems

S Kim, A Sim, K Wu - Journal of Big Data, 10 (1), 2023 - escholarship.org
Large-scale high performance computing (HPC) systems typically consist of many
thousands of CPUs and storage units used by hundreds to thousands of users …