A survey of power and energy predictive models in HPC systems and applications
Power and energy efficiency are now critical concerns in extreme-scale high-performance
scientific computing. Many extreme-scale computing systems today (for example: Top500) …
scientific computing. Many extreme-scale computing systems today (for example: Top500) …
Performance analysis of MPI collective operations
Previous studies of application usage show that the performance of collective
communications are critical for high-performance computing. Despite active research in the …
communications are critical for high-performance computing. Despite active research in the …
SparCML: High-performance sparse communication for machine learning
Applying machine learning techniques to the quickly growing data in science and industry
requires highly-scalable algorithms. Large datasets are most commonly processed" data …
requires highly-scalable algorithms. Large datasets are most commonly processed" data …
Optimization of collective reduction operations
R Rabenseifner - Computational Science-ICCS 2004: 4th International …, 2004 - Springer
A 5-year-profiling in production mode at the University of Stuttgart has shown that more than
40% of the execution time of Message Passing Interface (MPI) routines is spent in the …
40% of the execution time of Message Passing Interface (MPI) routines is spent in the …
Parallel matrix factorization for recommender systems
Matrix factorization, when the matrix has missing values, has become one of the leading
techniques for recommender systems. To handle web-scale datasets with millions of users …
techniques for recommender systems. To handle web-scale datasets with millions of users …
Optimization of MPI collective communication on BlueGene/L systems
G Almási, P Heidelberger, CJ Archer… - Proceedings of the 19th …, 2005 - dl.acm.org
BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of
low power dual-processor compute nodes interconnected by high speed torus and collective …
low power dual-processor compute nodes interconnected by high speed torus and collective …
Spectral embedded generalized mean based k-nearest neighbors clustering with s-distance
The spectral clustering algorithm is extensively employed in different aspects, especially in
the field of pattern recognition. However, the efficient construction of the neighborhood …
the field of pattern recognition. However, the efficient construction of the neighborhood …
Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining
Communication scheduling has been shown to be effective in accelerating distributed
training, which enables all-reduce communications to be overlapped with backpropagation …
training, which enables all-reduce communications to be overlapped with backpropagation …
NUMA-aware shared-memory collective communication for MPI
As the number of cores per node keeps increasing, it becomes increasingly important for
MPI to leverage shared memory for intranode communication. This paper investigates the …
MPI to leverage shared memory for intranode communication. This paper investigates the …
MPI support for multi-core architectures: Optimized shared memory collectives
RL Graham, G Shipman - Recent Advances in Parallel Virtual Machine …, 2008 - Springer
With local core counts on the rise, taking advantage of shared-memory to optimize collective
operations can improve performance. We study several on-host shared memory optimized …
operations can improve performance. We study several on-host shared memory optimized …