Demystifying bert: System design implications
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …
tackle challenging problems. Consequently, these applications are driving the requirements …
vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …
critical challenge the AI community is facing is how to train these large AI models in a cost …
Principal kernel analysis: A tractable methodology to simulate scaled GPU workloads
C Avalos Baddouh, M Khairy, RN Green… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-
level simulation is orders of magnitude slower than native silicon, the only solution is to …
level simulation is orders of magnitude slower than native silicon, the only solution is to …
Demystifying bert: Implications for accelerator design
Transfer learning in natural language processing (NLP), as realized using models like BERT
(Bi-directional Encoder Representation from Transformer), has significantly improved …
(Bi-directional Encoder Representation from Transformer), has significantly improved …
Sieve: Stratified GPU-compute workload sampling
To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute
workloads have evolved from simple computational kernels to large-scale programs with …
workloads have evolved from simple computational kernels to large-scale programs with …
Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware
Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling also increased the reliance on efficient distributed training techniques …
However, this scaling also increased the reliance on efficient distributed training techniques …
Global Optimizations & Lightweight Dynamic Logic for Concurrency
Modern accelerators like GPUs are increasingly executing independent operations
concurrently to improve the device's compute utilization. However, effectively harnessing it …
concurrently to improve the device's compute utilization. However, effectively harnessing it …
[PDF][PDF] Simulating Machine Learning Models at Scale
V Ramadas, MD Sinclair - SRC TECHCON, 2024 - pages.cs.wisc.edu
In recent years deep neural networks (DNNs) have emerged as an important application
domain driving the requirements for future systems. As DNNs get more sophisticated, their …
domain driving the requirements for future systems. As DNNs get more sophisticated, their …
[PDF][PDF] Simulation Support for Fast and Accurate Large-Scale GPGPU & Accelerator Workloads
In recent years deep neural networks (DNNs) have emerged as an important application
domain driving the requirements for future systems. As DNNs get more sophisticated, their …
domain driving the requirements for future systems. As DNNs get more sophisticated, their …
Tpupoint: Automatic characterization of hardware-accelerated machine-learning behavior for cloud computing
With the share of machine learning (ML) workloads in data centers rapidly increasing, cloud
providers are beginning to incorporate accelerators such as tensor processing units (TPUs) …
providers are beginning to incorporate accelerators such as tensor processing units (TPUs) …