Communication lower bounds and optimal algorithms for numerical linear algebra

G Ballard, E Carson, J Demmel, M Hoemmen… - Acta Numerica, 2014 - cambridge.org
The traditional metric for the efficiency of a numerical algorithm has been the number of
arithmetic operations it performs. Technological trends have long been reducing the time to …

Faster all-pairs shortest paths via circuit complexity

R Williams - Proceedings of the forty-sixth annual ACM symposium …, 2014 - dl.acm.org
We present a new randomized method for computing the min-plus product (aka, tropical
product) of two n× n matrices, yielding a faster algorithm for solving the all-pairs shortest …

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs

Q Wang, X Zhang, Y Zhang, Q Yi - Proceedings of the international …, 2013 - dl.acm.org
Basic Liner algebra subprograms (BLAS) is a fundamental library in scientific computing. In
this paper, we present a template-based optimization framework, AUGEM, which can …

Algebraic methods in the congested clique

K Censor-Hillel, P Kaski, JH Korhonen… - Proceedings of the …, 2015 - dl.acm.org
In this work, we use algebraic methods for studying distance computation and subgraph
detection tasks in the congested clique model. Specifically, we adapt parallel matrix …

Communication-optimal parallel recursive rectangular matrix multiplication

J Demmel, D Eliahu, A Fox, S Kamil… - 2013 IEEE 27th …, 2013 - ieeexplore.ieee.org
Communication-optimal algorithms are known for square matrix multiplication. Here, we
obtain the first communication-optimal algorithm for all dimensions of rectangular matrices …

A framework for practical parallel fast matrix multiplication

AR Benson, G Ballard - ACM SIGPLAN Notices, 2015 - dl.acm.org
Matrix multiplication is a fundamental computation in many scientific disciplines. In this
paper, we show that novel fast matrix multiplication algorithms can significantly outperform …

Graph expansion and communication costs of fast matrix multiplication

G Ballard, J Demmel, O Holtz, O Schwartz - Journal of the ACM (JACM), 2013 - dl.acm.org
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely
related to the expansion properties of the corresponding computation graphs. We …

Communication optimal parallel multiplication of sparse random matrices

G Ballard, A Buluc, J Demmel, L Grigori… - Proceedings of the …, 2013 - dl.acm.org
Parallel algorithms for sparse matrix-matrix multiplication typically spend most of their time
on inter-processor communication rather than on computation, and hardware trends predict …

Pebbling Game and Alternative Basis for High Performance Matrix Multiplication

O Schwartz, N Vaknin - SIAM Journal on Scientific Computing, 2023 - SIAM
Matrix multiplication is one of the most extensively used kernels in scientific computing.
Although subcubic algorithms exist, most high performance implementations are based on …

Scalable graph convolutional network training on distributed-memory systems

GV Demirci, A Haldar, H Ferhatosmanoglu - arxiv preprint arxiv …, 2022 - arxiv.org
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
The large data sizes of graphs and their vertex features make scalable training algorithms …