On the opportunities and risks of foundation models
AI is undergoing a paradigm shift with the rise of models (eg, BERT, DALL-E, GPT-3) that are
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
trained on broad data at scale and are adaptable to a wide range of downstream tasks. We …
Pytorch fsdp: experiences on scaling fully sharded data parallel
It is widely acknowledged that large models have the potential to deliver superior
performance across a broad range of domains. Despite the remarkable progress made in …
performance across a broad range of domains. Despite the remarkable progress made in …
Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems
Deep learning for recommendation data is one of the most pervasive and challenging AI
workload in recent times. State-of-the-art recommendation models are one of the largest …
workload in recent times. State-of-the-art recommendation models are one of the largest …
Training personalized recommendation systems from (GPU) scratch: Look forward not backwards
Personalized recommendation models (RecSys) are one of the most popular machine
learning workload serviced by hyperscalers. A critical challenge of training RecSys is its …
learning workload serviced by hyperscalers. A critical challenge of training RecSys is its …
A survey on auto-parallelism of large-scale deep learning training
P Liang, Y Tang, X Zhang, Y Bai, T Su… - … on Parallel and …, 2023 - ieeexplore.ieee.org
Deep learning (DL) has gained great success in recent years, leading to state-of-the-art
performance in research community and industrial fields like computer vision and natural …
performance in research community and industrial fields like computer vision and natural …
Enabling compute-communication overlap in distributed deep learning training platforms
Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …
(eg, GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth …
Graphpipe: Improving performance and scalability of dnn training with graph pipeline parallelism
Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to
train on a single device. Pipeline parallelism is commonly used in existing DNN systems to …
train on a single device. Pipeline parallelism is commonly used in existing DNN systems to …
Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters
Recent years have witnessed an exponential growth of model scale in deep learning-based
recommender systems---from Google's 2016 model with 1 billion parameters to the latest …
recommender systems---from Google's 2016 model with 1 billion parameters to the latest …
The trade-offs of model size in large recommendation models: 100GB to 10MB Criteo-tb DLRM model
A Desai, A Shrivastava - Advances in Neural Information …, 2022 - proceedings.neurips.cc
Embedding tables dominate industrial-scale recommendation model sizes, using up to
terabytes of memory. A popular and the largest publicly available machine learning MLPerf …
terabytes of memory. A popular and the largest publicly available machine learning MLPerf …
EmbedX: A Versatile, Efficient and Scalable Platform to Embed Both Graphs and High-Dimensional Sparse Data
In modern online services, it is of growing importance to process web-scale graph data and
high-dimensional sparse data together into embeddings for downstream tasks, such as …
high-dimensional sparse data together into embeddings for downstream tasks, such as …