- Academic Search

Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org

As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Salva Cita Citato da 30 Articoli correlati Tutte e 4 le versioni

Fault and self-repair for high reliability in die-to-die interconnection of 2.5 D/3D IC

R Song, J Zhang, Z Zhu, G Shan, Y Yang - Microelectronics Reliability, 2024 - Elsevier

Bringing dies closer by die-to-die interconnection is a way that reduces latency and energy
per bit transmitted, while increasing bandwidth per mm of chip. Heterogeneous integration …

Salva Cita Citato da 4 Articoli correlati

[Free GPT-4]
[DeepSeek]

[PDF] acm.org

Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems

Y Feng, D **ang, K Ma - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org

The chiplet architecture is one of the emerging methodologies and is believed to be scalable
and economical. However, most current multi-chiplet systems are based on one uniform die …

Salva Cita Citato da 11 Articoli correlati Tutte e 4 le versioni

[Free GPT-4]
[DeepSeek]

[PDF] ieee.org

A Survey on Performance Modeling and Prediction for Distributed DNN Training

Z Guo, Y Tang, J Zhai, T Yuan, J **… - … on Parallel and …, 2024 - ieeexplore.ieee.org

The recent breakthroughs in large-scale DNN attract significant attention from both
academia and industry toward distributed DNN training techniques. Due to the time …

Salva Cita Articoli correlati Tutte e 5 le versioni

Leveraging Memory Expansion to Accelerate Large-Scale DL Training

D Kadiyala, S Rashidi, T Heo… - … Analysis of Systems …, 2024 - ieeexplore.ieee.org

Modern Deep Learning (DL) models require massive clusters of specialized, high-end
nodes to train. Designing such clusters to maximize both performance and utilization is a …

Salva Cita Articoli correlati Tutte e 2 le versioni

Crea avviso

Cita

Ricerca avanzata

Salvato in La mia biblioteca

COMET: A comprehensive cluster design methodology for distributed deep learning training

Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

Fault and self-repair for high reliability in die-to-die interconnection of 2.5 D/3D IC

Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems

A Survey on Performance Modeling and Prediction for Distributed DNN Training

Leveraging Memory Expansion to Accelerate Large-Scale DL Training