Datastates-llm: Lazy asynchronous checkpointing for large language models

A Maurya, R Underwood, MM Rafique… - Proceedings of the 33rd …, 2024 - dl.acm.org
LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-
performance computing (HPC) infrastructures and ingest massive amounts of input data …

Combining Compression and Prefetching to Improve Checkpointing for Inverse Seismic Problems in GPUs

T Maltempi, S Rigo, M Pereira, H Yviquel… - … Conference on Parallel …, 2024 - Springer
Inverse problems are crucial in various scientific and engineering fields requiring intricate
mathematical and computational modeling. An example of such a problem is the Full …

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

A Maurya - 2024 - search.proquest.com
The exponential growth of data-intensive scientific simulations and deep learning workloads
presents significant challenges for high-performance computing (HPC) systems. These …