Llmcompass: Enabling efficient hardware design for large language model inference
The past year has witnessed the increasing popularity of Large Language Models (LLMs).
Their unprecedented scale and associated high hardware cost have impeded their broader …
Their unprecedented scale and associated high hardware cost have impeded their broader …
System technology co-optimization for advanced integration
Advanced integration and packaging will drive the scaling of computing systems in the next
decade. Diversity in performance, cost and scale of the emerging systems implies that …
decade. Diversity in performance, cost and scale of the emerging systems implies that …
Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale
As deep learning models and input data continue to scale at an unprecedented rate, it has
become inevitable to move towards distributed training platforms to fit the models and …
become inevitable to move towards distributed training platforms to fit the models and …
Darl: Distributed reconfigurable accelerator for hyperdimensional reinforcement learning
Reinforcement Learning (RL) is a powerful technology to solve decisionmaking problems
such as robotics control. Modern RL algorithms, ie, Deep Q-Learning, are based on costly …
such as robotics control. Modern RL algorithms, ie, Deep Q-Learning, are based on costly …
Enabling compute-communication overlap in distributed deep learning training platforms
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …
Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models
Distributed training is a solution to reduce DNN training time by splitting the task across
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …
multiple NPUs (eg, GPU/TPU). However, distributed training adds communication overhead …
Rosé: A hardware-software co-simulation infrastructure enabling pre-silicon full-stack robotics soc evaluation
Robotic systems, such as autonomous unmanned aerial vehicles (UAVs) and self-driving
cars, have been widely deployed in many scenarios and have the potential to revolutionize …
cars, have been widely deployed in many scenarios and have the potential to revolutionize …
Demystifying bert: System design implications
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …
tackle challenging problems. Consequently, these applications are driving the requirements …
Peta-scale embedded photonics architecture for distributed deep learning applications
As Deep Learning (DL) models grow larger and more complex, training jobs are
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …
increasingly distributed across multiple Computing Units (CU) such as GPUs and TPUs …
vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …
critical challenge the AI community is facing is how to train these large AI models in a cost …