Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Fault and self-repair for high reliability in die-to-die interconnection of 2.5 D/3D IC

R Song, J Zhang, Z Zhu, G Shan, Y Yang - Microelectronics Reliability, 2024 - Elsevier
Bringing dies closer by die-to-die interconnection is a way that reduces latency and energy
per bit transmitted, while increasing bandwidth per mm of chip. Heterogeneous integration …

Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems

Y Feng, D **ang, K Ma - Proceedings of the 56th Annual IEEE/ACM …, 2023 - dl.acm.org
The chiplet architecture is one of the emerging methodologies and is believed to be scalable
and economical. However, most current multi-chiplet systems are based on one uniform die …

A Survey on Performance Modeling and Prediction for Distributed DNN Training

Z Guo, Y Tang, J Zhai, T Yuan, J **… - … on Parallel and …, 2024 - ieeexplore.ieee.org
The recent breakthroughs in large-scale DNN attract significant attention from both
academia and industry toward distributed DNN training techniques. Due to the time …

Leveraging Memory Expansion to Accelerate Large-Scale DL Training

D Kadiyala, S Rashidi, T Heo… - … Analysis of Systems …, 2024 - ieeexplore.ieee.org
Modern Deep Learning (DL) models require massive clusters of specialized, high-end
nodes to train. Designing such clusters to maximize both performance and utilization is a …