{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}
We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …
deploying MegaScale, a production system for training large language models (LLMs) at the …
Toward exascale resilience
Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …
computing (HPC) systems, in particular in the perspective of large petascale systems and …
Proactive fault tolerance for HPC with Xen virtualization
Large-scale parallel computing is relying increasingly on clusters with thousands of
processors. At such large counts of compute nodes, faults are becoming common place …
processors. At such large counts of compute nodes, faults are becoming common place …
Proactive process-level live migration in HPC environments
As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …
Pets 2016: Dataset and challenge
This paper describes the datasets and computer vision challenges that form part of the PETS
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …
Performance evaluation of adaptive MPI
Processor virtualization via migratable objects is a powerful technique that enables the
runtime system to carry out intelligent adaptive optimizations like dynamic resource …
runtime system to carry out intelligent adaptive optimizations like dynamic resource …
Proactive fault tolerance in MPI applications via task migration
Failures are likely to be more frequent in systems with thousands of processors. Therefore,
schemes for dealing with faults become increasingly important. In this paper, we present a …
schemes for dealing with faults become increasingly important. In this paper, we present a …
Canary: fault-tolerant faas for stateful time-sensitive applications
Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …
applications have been migrated to FaaS platforms due to their ease of deployment …
Exploit failure prediction for adaptive fault-tolerance in cluster computing
Y Li, Z Lan - Sixth IEEE International Symposium on Cluster …, 2006 - ieeexplore.ieee.org
As the scale of cluster computing grows, it is becoming hard for long-running applications to
complete without facing failures on large-scale clusters. To address this issue …
complete without facing failures on large-scale clusters. To address this issue …
A framework for proactive fault tolerance
G Vallee, K Charoenpornwattana… - 2008 Third …, 2008 - ieeexplore.ieee.org
Fault tolerance is a major concern to guarantee availability of critical services as well as
application execution. Traditional approaches for fault tolerance include checkpoint/restart …
application execution. Traditional approaches for fault tolerance include checkpoint/restart …