Demystifying bert: System design implications

S Pati, S Aga, N Jayasena… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
Transfer learning in natural language processing (NLP) uses increasingly large models that
tackle challenging problems. Consequently, these applications are driving the requirements …

vtrain: A simulation framework for evaluating cost-effective and compute-optimal large language model training

J Bang, Y Choi, M Kim, Y Kim… - 2024 57th IEEE/ACM …, 2024 - ieeexplore.ieee.org
As large language models (LLMs) become widespread in various application domains, a
critical challenge the AI community is facing is how to train these large AI models in a cost …

Principal kernel analysis: A tractable methodology to simulate scaled GPU workloads

C Avalos Baddouh, M Khairy, RN Green… - MICRO-54: 54th Annual …, 2021 - dl.acm.org
Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-
level simulation is orders of magnitude slower than native silicon, the only solution is to …

Demystifying bert: Implications for accelerator design

S Pati, S Aga, N Jayasena, MD Sinclair - arxiv preprint arxiv:2104.08335, 2021 - arxiv.org
Transfer learning in natural language processing (NLP), as realized using models like BERT
(Bi-directional Encoder Representation from Transformer), has significantly improved …

Sieve: Stratified GPU-compute workload sampling

M Naderan-Tahan, H SeyyedAghaei… - … Analysis of Systems …, 2023 - ieeexplore.ieee.org
To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute
workloads have evolved from simple computational kernels to large-scale programs with …

Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware

S Pati, S Aga, M Islam, N Jayasena… - 2023 IEEE …, 2023 - ieeexplore.ieee.org
Scaling neural network models has delivered dramatic quality gains across ML problems.
However, this scaling also increased the reliance on efficient distributed training techniques …

Global Optimizations & Lightweight Dynamic Logic for Concurrency

S Pati, S Aga, N Jayasena, MD Sinclair - arxiv preprint arxiv:2409.02227, 2024 - arxiv.org
Modern accelerators like GPUs are increasingly executing independent operations
concurrently to improve the device's compute utilization. However, effectively harnessing it …

[PDF][PDF] Simulating Machine Learning Models at Scale

V Ramadas, MD Sinclair - SRC TECHCON, 2024 - pages.cs.wisc.edu
In recent years deep neural networks (DNNs) have emerged as an important application
domain driving the requirements for future systems. As DNNs get more sophisticated, their …

[PDF][PDF] Simulation Support for Fast and Accurate Large-Scale GPGPU & Accelerator Workloads

V Ramadas, M Poremba, B Beckmann… - Third Workshop on …, 2024 - pages.cs.wisc.edu
In recent years deep neural networks (DNNs) have emerged as an important application
domain driving the requirements for future systems. As DNNs get more sophisticated, their …

Tpupoint: Automatic characterization of hardware-accelerated machine-learning behavior for cloud computing

A Wudenhe, HW Tseng - 2021 IEEE International Symposium …, 2021 - ieeexplore.ieee.org
With the share of machine learning (ML) workloads in data centers rapidly increasing, cloud
providers are beginning to incorporate accelerators such as tensor processing units (TPUs) …