Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020‏ - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Performance enhancement of artificial intelligence: A survey

M Krichen, MS Abdalzaher - Journal of Network and Computer Applications, 2024‏ - Elsevier
The advent of machine learning (ML) and Artificial intelligence (AI) has brought about a
significant transformation across multiple industries, as it has facilitated the automation of …

Privacy preserving machine learning with homomorphic encryption and federated learning

H Fang, Q Qian - Future Internet, 2021‏ - mdpi.com
Privacy protection has been an important concern with the great success of machine
learning. In this paper, it proposes a multi-party privacy preserving machine learning …

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018‏ - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

Gaia:{Geo-Distributed} machine learning approaching {LAN} speeds

K Hsieh, A Harlap, N Vijaykumar, D Konomis… - … USENIX symposium on …, 2017‏ - usenix.org
Machine learning (ML) is widely used to derive useful information from large-scale data
(such as user activities, pictures, and videos) generated at increasingly rapid rates, all over …

Pipedream: Fast and efficient pipeline parallel dnn training

A Harlap, D Narayanan, A Phanishayee… - arxiv preprint arxiv …, 2018‏ - arxiv.org
PipeDream is a Deep Neural Network (DNN) training system for GPUs that parallelizes
computation by pipelining execution across multiple machines. Its pipeline parallel …

{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism

JH Park, G Yun, MY Chang, NT Nguyen, S Lee… - 2020 USENIX Annual …, 2020‏ - usenix.org
Deep Neural Network (DNN) models have continuously been growing in size in order to
improve the accuracy and quality of the models. Moreover, for training of large DNN models …

HET: scaling out huge embedding model training via cache-enabled distributed framework

X Miao, H Zhang, Y Shi, X Nie, Z Yang, Y Tao… - arxiv preprint arxiv …, 2021‏ - arxiv.org
Embedding models have been an effective learning paradigm for high-dimensional data.
However, one open issue of embedding models is that their representations (latent factors) …

Baechi: fast device placement of machine learning graphs

B Jeon, L Cai, P Srivastava, J Jiang, X Ke… - Proceedings of the 11th …, 2020‏ - dl.acm.org
Machine Learning graphs (or models) can be challenging or impossible to train when either
devices have limited memory, or the models are large. Splitting the model graph across …

Supporting very large models using automatic dataflow graph partitioning

M Wang, C Huang, J Li - … of the Fourteenth EuroSys Conference 2019, 2019‏ - dl.acm.org
This paper presents Tofu, a system that partitions very large DNN models across multiple
GPU devices to reduce per-GPU memory footprint. Tofu is designed to partition a dataflow …