{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024 - usenix.org
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Proactive fault tolerance for HPC with Xen virtualization

AB Nagarajan, F Mueller, C Engelmann… - Proceedings of the 21st …, 2007 - dl.acm.org
Large-scale parallel computing is relying increasingly on clusters with thousands of
processors. At such large counts of compute nodes, faults are becoming common place …

Proactive process-level live migration in HPC environments

C Wang, F Mueller, C Engelmann… - SC'08: Proceedings of …, 2008 - ieeexplore.ieee.org
As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

Pets 2016: Dataset and challenge

L Patino, T Cane, A Vallee… - Proceedings of the IEEE …, 2016 - cv-foundation.org
This paper describes the datasets and computer vision challenges that form part of the PETS
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …

Performance evaluation of adaptive MPI

C Huang, G Zheng, L Kalé, S Kumar - Proceedings of the eleventh ACM …, 2006 - dl.acm.org
Processor virtualization via migratable objects is a powerful technique that enables the
runtime system to carry out intelligent adaptive optimizations like dynamic resource …

Proactive fault tolerance in MPI applications via task migration

S Chakravorty, CL Mendes, LV Kalé - International Conference on High …, 2006 - Springer
Failures are likely to be more frequent in systems with thousands of processors. Therefore,
schemes for dealing with faults become increasingly important. In this paper, we present a …

Canary: fault-tolerant faas for stateful time-sensitive applications

M Arif, K Assogba, MM Rafique - … : International Conference for …, 2022 - ieeexplore.ieee.org
Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …

Exploit failure prediction for adaptive fault-tolerance in cluster computing

Y Li, Z Lan - Sixth IEEE International Symposium on Cluster …, 2006 - ieeexplore.ieee.org
As the scale of cluster computing grows, it is becoming hard for long-running applications to
complete without facing failures on large-scale clusters. To address this issue …

A framework for proactive fault tolerance

G Vallee, K Charoenpornwattana… - 2008 Third …, 2008 - ieeexplore.ieee.org
Fault tolerance is a major concern to guarantee availability of critical services as well as
application execution. Traditional approaches for fault tolerance include checkpoint/restart …