A survey of power and energy predictive models in HPC systems and applications

K O'brien, I Pietri, R Reddy, A Lastovetsky… - ACM Computing …, 2017 - dl.acm.org
Power and energy efficiency are now critical concerns in extreme-scale high-performance
scientific computing. Many extreme-scale computing systems today (for example: Top500) …

Performance analysis of MPI collective operations

J Pješivac-Grbović, T Angskun, G Bosilca, GE Fagg… - Cluster …, 2007 - Springer
Previous studies of application usage show that the performance of collective
communications are critical for high-performance computing. Despite active research in the …

SparCML: High-performance sparse communication for machine learning

C Renggli, S Ashkboos, M Aghagolzadeh… - Proceedings of the …, 2019 - dl.acm.org
Applying machine learning techniques to the quickly growing data in science and industry
requires highly-scalable algorithms. Large datasets are most commonly processed" data …

Optimization of collective reduction operations

R Rabenseifner - Computational Science-ICCS 2004: 4th International …, 2004 - Springer
A 5-year-profiling in production mode at the University of Stuttgart has shown that more than
40% of the execution time of Message Passing Interface (MPI) routines is spent in the …

Parallel matrix factorization for recommender systems

HF Yu, CJ Hsieh, S Si, IS Dhillon - Knowledge and Information Systems, 2014 - Springer
Matrix factorization, when the matrix has missing values, has become one of the leading
techniques for recommender systems. To handle web-scale datasets with millions of users …

Optimization of MPI collective communication on BlueGene/L systems

G Almási, P Heidelberger, CJ Archer… - Proceedings of the 19th …, 2005 - dl.acm.org
BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of
low power dual-processor compute nodes interconnected by high speed torus and collective …

Spectral embedded generalized mean based k-nearest neighbors clustering with s-distance

KK Sharma, A Seal - Expert Systems with Applications, 2021 - Elsevier
The spectral clustering algorithm is extensively employed in different aspects, especially in
the field of pattern recognition. However, the efficient construction of the neighborhood …

Dear: Accelerating distributed deep learning with fine-grained all-reduce pipelining

L Zhang, S Shi, X Chu, W Wang, B Li… - 2023 IEEE 43rd …, 2023 - ieeexplore.ieee.org
Communication scheduling has been shown to be effective in accelerating distributed
training, which enables all-reduce communications to be overlapped with backpropagation …

NUMA-aware shared-memory collective communication for MPI

S Li, T Hoefler, M Snir - … of the 22nd international symposium on High …, 2013 - dl.acm.org
As the number of cores per node keeps increasing, it becomes increasingly important for
MPI to leverage shared memory for intranode communication. This paper investigates the …

MPI support for multi-core architectures: Optimized shared memory collectives

RL Graham, G Shipman - Recent Advances in Parallel Virtual Machine …, 2008 - Springer
With local core counts on the rise, taking advantage of shared-memory to optimize collective
operations can improve performance. We study several on-host shared memory optimized …