Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …
become inevitable to move towards distributed training platforms to fit the models and …
Fault and self-repair for high reliability in die-to-die interconnection of 2.5 D/3D IC
R Song, J Zhang, Z Zhu, G Shan, Y Yang - Microelectronics Reliability, 2024 - Elsevier
Bringing dies closer by die-to-die interconnection is a way that reduces latency and energy
per bit transmitted, while increasing bandwidth per mm of chip. Heterogeneous integration …
per bit transmitted, while increasing bandwidth per mm of chip. Heterogeneous integration …
Heterogeneous Die-to-Die Interfaces: Enabling More Flexible Chiplet Interconnection Systems
The chiplet architecture is one of the emerging methodologies and is believed to be scalable
and economical. However, most current multi-chiplet systems are based on one uniform die …
and economical. However, most current multi-chiplet systems are based on one uniform die …
A Survey on Performance Modeling and Prediction for Distributed DNN Training
The recent breakthroughs in large-scale DNN attract significant attention from both
academia and industry toward distributed DNN training techniques. Due to the time …
academia and industry toward distributed DNN training techniques. Due to the time …
Leveraging Memory Expansion to Accelerate Large-Scale DL Training
Modern Deep Learning (DL) models require massive clusters of specialized, high-end
nodes to train. Designing such clusters to maximize both performance and utilization is a …
nodes to train. Designing such clusters to maximize both performance and utilization is a …