Randomness in neural network training: Characterizing the impact of tooling

D Zhuang, X Zhang, S Song… - Proceedings of Machine …, 2022 - proceedings.mlsys.org
The quest for determinism in machine learning has disproportionately focused on
characterizing the impact of noise introduced by algorithmic design choices. In this work, we …

Demystifying tensorrt: Characterizing neural network inference engine on nvidia edge devices

O Shafi, C Rai, R Sen… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
Edge devices are seeing tremendous growth in sensing and computational capabilities.
Running state-of-the-art deep neural network (NN) based data processing on multi-core …

A software-defined tensor streaming multiprocessor for large-scale machine learning

D Abts, G Kimmell, A Ling, J Kim, M Boyd… - Proceedings of the 49th …, 2022 - dl.acm.org
We describe our novel commercial software-defined approach for large-scale
interconnection networks of tensor streaming processing (TSP) elements. The system …

Not all gpus are created equal: characterizing variability in large-scale, accelerator-rich systems

P Sinha, A Guliani, R Jain, B Tran… - … Conference for High …, 2022 - ieeexplore.ieee.org
Scientists are increasingly exploring and utilizing the massive parallelism of general-
purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters …

Universal checkpointing: Efficient and flexible checkpointing for large scale distributed training

X Lian, SA Jacobs, L Kurilenko, M Tanaka… - arxiv preprint arxiv …, 2024 - arxiv.org
Existing checkpointing approaches seem ill-suited for distributed training even though
hardware limitations make model parallelism, ie, sharding model state across multiple …

Reproducibility of machine learning: Terminology, recommendations and open issues

R Albertoni, S Colantonio, P Skrzypczyński… - arxiv preprint arxiv …, 2023 - arxiv.org
Reproducibility is one of the core dimensions that concur to deliver Trustworthy Artificial
Intelligence. Broadly speaking, reproducibility can be defined as the possibility to reproduce …

On The Fairness Impacts of Hardware Selection in Machine Learning

SH Nelaturu, NK Ravichandran, C Tran… - … on Machine Learning, 2023 - openreview.net
In the machine learning ecosystem, hardware selection is often regarded as a mere utility,
overshadowed by the spotlight on algorithms and data. This is especially relevant in …

DISTWAR: Fast Differentiable Rendering on Raster-based Rendering Pipelines

S Durvasula, A Zhao, F Chen, R Liang… - arxiv preprint arxiv …, 2023 - arxiv.org
Differentiable rendering is a technique used in an important emerging class of visual
computing applications that involves representing a 3D scene as a model that is trained from …

Only buffer when you need to: Reducing on-chip gpu traffic with reconfigurable local atomic buffers

P Dalmia, R Mahapatra… - 2022 IEEE International …, 2022 - ieeexplore.ieee.org
In recent years, due to their wide availability and ease of programming, GPUs have emerged
as the accelerator of choice for a wide variety of applications including graph analytics and …

Optimistic Verifiable Training by Controlling Hardware Nondeterminism

M Srivastava, S Arora, D Boneh - arxiv preprint arxiv:2403.09603, 2024 - arxiv.org
The increasing compute demands of AI systems has led to the emergence of services that
train models on behalf of clients lacking necessary resources. However, ensuring …