- Academic Search

Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen… - … USENIX Symposium on …, 2024 - usenix.org

We present the design, implementation and engineering experience in building and
deploying MegaScale, a production system for training large language models (LLMs) at the …

Save Cite Cited by 83 Related articles All 4 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] illinois.edu

Toward exascale resilience

F Cappello, A Geist, B Gropp, L Kale… - … Journal of High …, 2009 - journals.sagepub.com

Over the past few years resilience has became a major issue for high-performance
computing (HPC) systems, in particular in the perspective of large petascale systems and …

Save Cite Cited by 485 Related articles All 14 versions Free GPT-4

[Free GPT-4]

[PDF] ncsu.edu

Proactive fault tolerance for HPC with Xen virtualization

AB Nagarajan, F Mueller, C Engelmann… - Proceedings of the 21st …, 2007 - dl.acm.org

Large-scale parallel computing is relying increasingly on clusters with thousands of
processors. At such large counts of compute nodes, faults are becoming common place …

Save Cite Cited by 528 Related articles All 20 versions Free GPT-4

[Free GPT-4]

[PDF] psu.edu

Proactive process-level live migration in HPC environments

C Wang, F Mueller, C Engelmann… - SC'08: Proceedings of …, 2008 - ieeexplore.ieee.org

As the number of nodes in high-performance computing environments keeps increasing,
faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to …

Save Cite Cited by 251 Related articles All 18 versions Free GPT-4

[Free GPT-4]

[PDF] cv-foundation.org

Pets 2016: Dataset and challenge

L Patino, T Cane, A Vallee… - Proceedings of the IEEE …, 2016 - cv-foundation.org

This paper describes the datasets and computer vision challenges that form part of the PETS
2016 workshop. PETS 2016 addresses the application of on-board multi sensor surveillance …

Save Cite Cited by 122 Related articles All 15 versions Free GPT-4 View as HTML

[Free GPT-4]

[PDF] academia.edu

Performance evaluation of adaptive MPI

C Huang, G Zheng, L Kalé, S Kumar - Proceedings of the eleventh ACM …, 2006 - dl.acm.org

Processor virtualization via migratable objects is a powerful technique that enables the
runtime system to carry out intelligent adaptive optimizations like dynamic resource …

Save Cite Cited by 183 Related articles All 20 versions Free GPT-4

[Free GPT-4]

[PDF] uiuc.edu

Proactive fault tolerance in MPI applications via task migration

S Chakravorty, CL Mendes, LV Kalé - International Conference on High …, 2006 - Springer

Failures are likely to be more frequent in systems with thousands of processors. Therefore,
schemes for dealing with faults become increasingly important. In this paper, we present a …

Save Cite Cited by 148 Related articles All 13 versions Free GPT-4

[Free GPT-4]

[PDF] nsf.gov

Canary: fault-tolerant faas for stateful time-sensitive applications

M Arif, K Assogba, MM Rafique - … : International Conference for …, 2022 - ieeexplore.ieee.org

Function-as-a-Service (FaaS) platforms have recently gained rapid popularity. Many stateful
applications have been migrated to FaaS platforms due to their ease of deployment …

Save Cite Cited by 8 Related articles All 5 versions Free GPT-4

[Free GPT-4]

[PDF] academia.edu

Exploit failure prediction for adaptive fault-tolerance in cluster computing

Y Li, Z Lan - Sixth IEEE International Symposium on Cluster …, 2006 - ieeexplore.ieee.org

As the scale of cluster computing grows, it is becoming hard for long-running applications to
complete without facing failures on large-scale clusters. To address this issue …

Save Cite Cited by 114 Related articles All 10 versions Free GPT-4

[Free GPT-4]

[PDF] psu.edu

A framework for proactive fault tolerance

G Vallee, K Charoenpornwattana… - 2008 Third …, 2008 - ieeexplore.ieee.org

Fault tolerance is a major concern to guarantee availability of critical services as well as
application execution. Traditional approaches for fault tolerance include checkpoint/restart …

Save Cite Cited by 95 Related articles All 16 versions Free GPT-4

Create alert

Cite

Advanced search

Saved to My library

Proactive fault tolerance in large systems

{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}

Toward exascale resilience

Proactive fault tolerance for HPC with Xen virtualization

Proactive process-level live migration in HPC environments

Pets 2016: Dataset and challenge

Performance evaluation of adaptive MPI

Proactive fault tolerance in MPI applications via task migration

Canary: fault-tolerant faas for stateful time-sensitive applications

Exploit failure prediction for adaptive fault-tolerance in cluster computing

A framework for proactive fault tolerance