Communication-efficient distributed deep learning: A comprehensive survey

Z Tang, S Shi, W Wang, B Li, X Chu - arxiv preprint arxiv:2003.06307, 2020 - arxiv.org
Distributed deep learning (DL) has become prevalent in recent years to reduce training time
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …

Deep gradient compression: Reducing the communication bandwidth for distributed training

Y Lin, S Han, H Mao, Y Wang, WJ Dally - arxiv preprint arxiv:1712.01887, 2017 - arxiv.org
Large-scale distributed training requires significant communication bandwidth for gradient
exchange that limits the scalability of multi-node training, and requires expensive high …

Tutel: Adaptive mixture-of-experts at scale

C Hwang, W Cui, Y **ong, Z Yang… - Proceedings of …, 2023 - proceedings.mlsys.org
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning
models to trillion-plus parameters with fixed computational cost. The algorithmic …

The pyramid match kernel: Discriminative classification with sets of image features

K Grauman, T Darrell - … on Computer Vision (ICCV'05) Volume …, 2005 - ieeexplore.ieee.org
Discriminative learning is challenging when examples are sets of features, and the sets vary
in cardinality and lack any sort of meaningful ordering. Kernel-based classification methods …

Optimization of collective communication operations in MPICH

R Thakur, R Rabenseifner… - The International Journal …, 2005 - journals.sagepub.com
We describe our work on improving the performance of collective communication operations
in MPICH for clusters connected by switched networks. For each collective operation, we …

The analysis of a plane wave pseudopotential density functional theory code on a GPU machine

W Jia, Z Cao, L Wang, J Fu, X Chi, W Gao… - Computer Physics …, 2013 - Elsevier
Plane wave pseudopotential (PWP) density functional theory (DFT) calculation is the most
widely used material science simulation, and the PWP DFT codes are arguably the most …

[BUCH][B] High performance visualization: Enabling extreme-scale scientific insight

EW Bethel, H Childs, C Hansen - 2012 - books.google.com
Visualization and analysis tools, techniques, and algorithms have undergone a rapid
evolution in recent decades to accommodate explosive growth in data size and complexity …

Performance analysis of MPI collective operations

J Pješivac-Grbović, T Angskun, G Bosilca, GE Fagg… - Cluster …, 2007 - Springer
Previous studies of application usage show that the performance of collective
communications are critical for high-performance computing. Despite active research in the …

Optimization of collective reduction operations

R Rabenseifner - Computational Science-ICCS 2004: 4th International …, 2004 - Springer
A 5-year-profiling in production mode at the University of Stuttgart has shown that more than
40% of the execution time of Message Passing Interface (MPI) routines is spent in the …

A unified coded deep neural network training strategy based on generalized polydot codes

S Dutta, Z Bai, H Jeong, TM Low… - 2018 IEEE International …, 2018 - ieeexplore.ieee.org
This paper has two main contributions. First, we propose a novel coding technique-
Generalized PolyDot-for matrix-vector products that advances on existing techniques for …