Communication-efficient distributed deep learning: A comprehensive survey
Distributed deep learning (DL) has become prevalent in recent years to reduce training time
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …
by leveraging multiple computing devices (eg, GPUs/TPUs) due to larger models and …
Deep gradient compression: Reducing the communication bandwidth for distributed training
Large-scale distributed training requires significant communication bandwidth for gradient
exchange that limits the scalability of multi-node training, and requires expensive high …
exchange that limits the scalability of multi-node training, and requires expensive high …
Tutel: Adaptive mixture-of-experts at scale
Sparsely-gated mixture-of-experts (MoE) has been widely adopted to scale deep learning
models to trillion-plus parameters with fixed computational cost. The algorithmic …
models to trillion-plus parameters with fixed computational cost. The algorithmic …
The pyramid match kernel: Discriminative classification with sets of image features
Discriminative learning is challenging when examples are sets of features, and the sets vary
in cardinality and lack any sort of meaningful ordering. Kernel-based classification methods …
in cardinality and lack any sort of meaningful ordering. Kernel-based classification methods …
Optimization of collective communication operations in MPICH
R Thakur, R Rabenseifner… - The International Journal …, 2005 - journals.sagepub.com
We describe our work on improving the performance of collective communication operations
in MPICH for clusters connected by switched networks. For each collective operation, we …
in MPICH for clusters connected by switched networks. For each collective operation, we …
The analysis of a plane wave pseudopotential density functional theory code on a GPU machine
Plane wave pseudopotential (PWP) density functional theory (DFT) calculation is the most
widely used material science simulation, and the PWP DFT codes are arguably the most …
widely used material science simulation, and the PWP DFT codes are arguably the most …
[BUCH][B] High performance visualization: Enabling extreme-scale scientific insight
Visualization and analysis tools, techniques, and algorithms have undergone a rapid
evolution in recent decades to accommodate explosive growth in data size and complexity …
evolution in recent decades to accommodate explosive growth in data size and complexity …
Performance analysis of MPI collective operations
Previous studies of application usage show that the performance of collective
communications are critical for high-performance computing. Despite active research in the …
communications are critical for high-performance computing. Despite active research in the …
Optimization of collective reduction operations
R Rabenseifner - Computational Science-ICCS 2004: 4th International …, 2004 - Springer
A 5-year-profiling in production mode at the University of Stuttgart has shown that more than
40% of the execution time of Message Passing Interface (MPI) routines is spent in the …
40% of the execution time of Message Passing Interface (MPI) routines is spent in the …
A unified coded deep neural network training strategy based on generalized polydot codes
This paper has two main contributions. First, we propose a novel coding technique-
Generalized PolyDot-for matrix-vector products that advances on existing techniques for …
Generalized PolyDot-for matrix-vector products that advances on existing techniques for …