Demystifying parallel and distributed deep learning: An in-depth concurrency analysis

T Ben-Nun, T Hoefler - ACM Computing Surveys (CSUR), 2019 - dl.acm.org
Deep Neural Networks (DNNs) are becoming an important tool in modern computing
applications. Accelerating their training is a major challenge and techniques range from …

Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools

R Mayer, HA Jacobsen - ACM Computing Surveys (CSUR), 2020 - dl.acm.org
Deep Learning (DL) has had an immense success in the recent past, leading to state-of-the-
art results in various domains, such as image recognition and natural language processing …

Deepchain: Auditable and privacy-preserving deep learning with blockchain-based incentive

J Weng, J Weng, J Zhang, M Li… - IEEE Transactions on …, 2019 - ieeexplore.ieee.org
Deep learning can achieve higher accuracy than traditional machine learning algorithms in
a variety of machine learning tasks. Recently, privacy-preserving deep learning has drawn …

Terngrad: Ternary gradients to reduce communication in distributed deep learning

W Wen, C Xu, F Yan, C Wu, Y Wang… - Advances in neural …, 2017 - proceedings.neurips.cc
High network communication cost for synchronizing gradients and parameters is the well-
known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary …

Optimus: an efficient dynamic resource scheduler for deep learning clusters

Y Peng, Y Bao, Y Chen, C Wu, C Guo - Proceedings of the Thirteenth …, 2018 - dl.acm.org
Deep learning workloads are common in today's production clusters due to the proliferation
of deep learning driven AI services (eg, speech recognition, machine translation). A deep …

Error compensated quantized SGD and its applications to large-scale distributed optimization

J Wu, W Huang, J Huang… - … conference on machine …, 2018 - proceedings.mlr.press
Large-scale distributed optimization is of great importance in various applications. For data-
parallel based distributed learning, the inter-node gradient communication often becomes …

Combination of short-term load forecasting models based on a stacking ensemble approach

J Moon, S Jung, J Rew, S Rho, E Hwang - Energy and Buildings, 2020 - Elsevier
Building electric energy consumption forecasting is essential in establishing an energy
operation strategy for building energy management systems. Because of recent …

Autofl: Enabling heterogeneity-aware energy efficient federated learning

YG Kim, CJ Wu - MICRO-54: 54th Annual IEEE/ACM International …, 2021 - dl.acm.org
Federated learning enables a cluster of decentralized mobile devices at the edge to
collaboratively train a shared machine learning model, while kee** all the raw training …

Online job scheduling in distributed machine learning clusters

Y Bao, Y Peng, C Wu, Z Li - IEEE INFOCOM 2018-IEEE …, 2018 - ieeexplore.ieee.org
Nowadays large-scale distributed machine learning systems have been deployed to support
various analytics and intelligence services in IT firms. To train a large dataset and derive the …

λDNN: Achieving Predictable Distributed DNN Training With Serverless Architectures

F Xu, Y Qin, L Chen, Z Zhou… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Serverless computing is becoming a promising paradigm for Distributed Deep Neural
Network (DDNN) training in the cloud, as it allows users to decompose complex model …