I/o access patterns in hpc applications: A 360-degree survey

JL Bez, S Byna, S Ibrahim - ACM Computing Surveys, 2023 - dl.acm.org
The high-performance computing I/O stack has been complex due to multiple software
layers, the inter-dependencies among these layers, and the different performance tuning …

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

Analyzing and mitigating data stalls in DNN training

J Mohan, A Phanishayee, A Raniwala… - arxiv preprint arxiv …, 2020 - arxiv.org
Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While
prior research has explored many different ways of reducing DNN training time, the impact of …

Fluid: Dataset abstraction and elastic acceleration for cloud-native deep learning training jobs

R Gu, K Zhang, Z Xu, Y Che, B Fan… - 2022 IEEE 38th …, 2022 - ieeexplore.ieee.org
Nowdays, it is prevalent to train deep learning (DL) models in cloud-native platforms that
actively leverage containerization and orchestration technologies for high elasticity, low and …

{SHADE}: Enable fundamental cacheability for distributed deep learning training

RIS Khan, AH Yazdani, Y Fu, AK Paul, B Ji… - … USENIX Conference on …, 2023 - usenix.org
Deep learning training (DLT) applications exhibit unique I/O workload behaviors that pose
new challenges for storage system design. DLT is I/O intensive since data samples need to …

Clairvoyant prefetching for distributed machine learning I/O

N Dryden, R Böhringer, T Ben-Nun… - Proceedings of the …, 2021 - dl.acm.org
I/O is emerging as a major bottleneck for machine learning training, especially in distributed
environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing …

Quiver: An informed storage cache for deep learning

AV Kumar, M Sivathanu - 18th USENIX Conference on File and Storage …, 2020 - usenix.org
We introduce Quiver, an informed storage cache for deep learning training (DLT) jobs in a
cluster of GPUs. Quiver employs domain-specific intelligence within the caching layer, to …

I/o characterization and performance evaluation of beegfs for deep learning

F Chowdhury, Y Zhu, T Heer, S Paredes… - Proceedings of the 48th …, 2019 - dl.acm.org
Parallel File Systems (PFSs) are frequently deployed on leadership High Performance
Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable …

Deepfreeze: Towards scalable asynchronous checkpointing of deep learning models

B Nicolae, J Li, JM Wozniak, G Bosilca… - 2020 20th IEEE/ACM …, 2020 - ieeexplore.ieee.org
In the age of big data, deep learning has emerged as a powerful tool to extract insight and
exploit its value, both in industry and scientific applications. One common pattern emerging …

Why globally re-shuffle? Revisiting data shuffling in large scale deep learning

TT Nguyen, F Trahay, J Domke, A Drozd… - 2022 IEEE …, 2022 - ieeexplore.ieee.org
Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural
Networks (DNN). SGD iterates the input data set in each training epoch processing data …