Sustainable ai: Environmental implications, challenges and opportunities

CJ Wu, R Raghavendra, U Gupta… - Proceedings of …, 2022 - proceedings.mlsys.org
This paper explores the environmental impact of the super-linear growth trends for AI from a
holistic perspective, spanning Data, Algorithms, and System Hardware. We characterize the …

Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product

M Zhao, N Agarwal, A Basant, B Gedik, S Pan… - Proceedings of the 49th …, 2022 - dl.acm.org
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators
(DSA) are used to train increasingly-complex deep learning models. These clusters rely on a …

{CASSINI}:{Network-Aware} Job Scheduling in Machine Learning Clusters

S Rajasekaran, M Ghobadi, A Akella - 21st USENIX Symposium on …, 2024 - usenix.org
We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters.
CASSINI introduces a novel geometric abstraction to consider the communication pattern of …

Estimating and penalizing induced preference shifts in recommender systems

MD Carroll, A Dragan, S Russell… - International …, 2022 - proceedings.mlr.press
The content that a recommender system (RS) shows to users influences them. Therefore,
when choosing a recommender to deploy, one is implicitly also choosing to induce specific …

Mtia: First generation silicon targeting meta's recommendation systems

A Firoozshahian, J Coburn, R Levenstein… - Proceedings of the 50th …, 2023 - dl.acm.org
Meta has traditionally relied on using CPU-based servers for running inference workloads,
specifically Deep Learning Recommendation Models (DLRM), but the increasing compute …

Congestion control in machine learning clusters

S Rajasekaran, M Ghobadi, G Kumar… - Proceedings of the 21st …, 2022 - dl.acm.org
This paper argues that fair-sharing, the holy grail of congestion control algorithms for
decades, is not necessarily a desirable property in Machine Learning (ML) training clusters …

Better Together: Jointly Optimizing {ML} Collective Scheduling and Execution Planning using {SYNDICATE}

K Mahajan, CH Chu, S Sridharan, A Akella - 20th USENIX Symposium …, 2023 - usenix.org
Emerging ML training deployments are trending towards larger models, and hybrid-parallel
training that is not just dominated by compute-intensive all-reduce for gradient aggregation …

Training personalized recommendation systems from (GPU) scratch: Look forward not backwards

Y Kwon, M Rhu - Proceedings of the 49th Annual International …, 2022 - dl.acm.org
Personalized recommendation models (RecSys) are one of the most popular machine
learning workload serviced by hyperscalers. A critical challenge of training RecSys is its …

Understanding {RDMA} microarchitecture resources for performance isolation

X Kong, J Chen, W Bai, Y Xu, M Elhaddad… - … USENIX Symposium on …, 2023 - usenix.org
Recent years have witnessed the wide adoption of RDMA in the cloud to accelerate first-
party workloads and achieve cost savings by freeing up CPU cycles. Now cloud providers …

Enabling compute-communication overlap in distributed deep learning training platforms

S Rashidi, M Denton, S Sridharan… - 2021 ACM/IEEE 48th …, 2021 - ieeexplore.ieee.org
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …