An in-depth analysis of the slingshot interconnect

D De Sensi, S Di Girolamo… - … Conference for High …, 2020 - ieeexplore.ieee.org
The interconnect is one of the most critical components in large scale computing systems,
and its impact on the performance of applications is going to increase with the system size …

The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications

A Agelastos, B Allan, J Brandt… - SC'14: Proceedings …, 2014 - ieeexplore.ieee.org
Understanding how resources of High Performance Compute platforms are utilized by
applications both individually and as a composite is key to application and platform …

Diagnosing performance variations in HPC applications using machine learning

O Tuncer, E Ates, Y Zhang, A Turk, J Brandt… - … Conference, ISC High …, 2017 - Springer
With the growing complexity and scale of high performance computing (HPC) systems,
application performance variation has become a significant challenge in efficient and …

An integrated tutorial on InfiniBand, verbs, and MPI

P MacArthur, Q Liu, RD Russell… - … Surveys & Tutorials, 2017 - ieeexplore.ieee.org
This tutorial presents the details of the interconnection network utilized in many high
performance computing (HPC) systems today.“InfiniBand” is the hardware interconnect …

Gossipgrad: Scalable deep learning using gossip communication based asynchronous gradient descent

J Daily, A Vishnu, C Siegel, T Warfel… - arxiv preprint arxiv …, 2018 - arxiv.org
In this paper, we present GossipGraD-a gossip communication protocol based Stochastic
Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale …

Online diagnosis of performance variation in HPC systems using machine learning

O Tuncer, E Ates, Y Zhang, A Turk… - … on Parallel and …, 2018 - ieeexplore.ieee.org
As the size and complexity of high performance computing (HPC) systems grow in line with
advancements in hardware and software technology, HPC systems increasingly suffer from …

Is big data performance reproducible in modern cloud networks?

A Uta, A Custura, D Duplyakin, I Jimenez… - … USENIX symposium on …, 2020 - usenix.org
Performance variability has been acknowledged as a problem for over a decade by cloud
practitioners and performance engineers. Yet, our survey of top systems conferences …

Watch out for the bully! job interference study on dragonfly network

X Yang, J Jenkins, M Mubarak… - SC'16: Proceedings of …, 2016 - ieeexplore.ieee.org
High-radix, low-diameter dragonfly networks will be a common choice in next-generation
supercomputers. Preliminary studies show that random job placement with adaptive routing …

Run-to-run variability on Xeon Phi based Cray XC systems

S Chunduri, K Harms, S Parker, V Morozov… - Proceedings of the …, 2017 - dl.acm.org
The increasing complexity of HPC systems has introduced new sources of variability, which
can contribute to significant differences in run-to-run performance of applications. With …

Analyzing network health and congestion in dragonfly-based supercomputers

A Bhatele, N Jain, Y Livnat, V Pascucci… - 2016 IEEE …, 2016 - ieeexplore.ieee.org
The dragonfly topology is a popular choice for building high-radix, low-diameter, hierarchical
networks with high-bandwidth links. On Cray installations of the dragonfly network, job …