Llmcompass: Enabling efficient hardware design for large language model inference

H Zhang, A Ning, RB Prabhakar… - 2024 ACM/IEEE 51st …, 2024 - ieeexplore.ieee.org
The past year has witnessed the increasing popularity of Large Language Models (LLMs).
Their unprecedented scale and associated high hardware cost have impeded their broader …

System technology co-optimization for advanced integration

S Pal, A Mallik, P Gupta - Nature Reviews Electrical Engineering, 2024 - nature.com
Advanced integration and packaging will drive the scaling of computing systems in the next
decade. Diversity in performance, cost and scale of the emerging systems implies that …

Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale

W Won, T Heo, S Rashidi, S Sridharan… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …

Darl: Distributed reconfigurable accelerator for hyperdimensional reinforcement learning

H Chen, M Issa, Y Ni, M Imani - Proceedings of the 41st IEEE/ACM …, 2022 - dl.acm.org
Reinforcement Learning (RL) is a powerful technology to solve decisionmaking problems
such as robotics control. Modern RL algorithms, ie, Deep Q-Learning, are based on costly …

Enabling compute-communication overlap in distributed deep learning training platforms

S Rashidi, M Denton, S Sridharan… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …

Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models

S Rashidi, W Won, S Srinivasan, S Sridharan… - Proceedings of the 49th …, 2022 - dl.acm.org
Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …

Rosé: A hardware-software co-simulation infrastructure enabling pre-silicon full-stack robotics soc evaluation

D Nikiforov, SC Dong, CL Zhang, S Kim… - Proceedings of the 50th …, 2023 - dl.acm.org
Robotic systems, such as autonomous unmanned aerial vehicles (UAVs) and self-driving
cars, have been widely deployed in many scenarios and have the potential to revolutionize …

Demystifying bert: System design implications

S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …

Peta-scale embedded photonics architecture for distributed deep learning applications

Z Wu, LY Dai, A Novick, M Glick, Z Zhu… - Journal of Lightwave …, 2023 - ieeexplore.ieee.org
As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …

vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training

J Bang, Y Choi, M Kim, Y Kim… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …