The evolution of distributed systems for graph neural networks and their origin in graph processing and deep learning: A survey
Graph neural networks (GNNs) are an emerging research field. This specialized deep
neural network architecture is capable of processing graph structured data and bridges the …
neural network architecture is capable of processing graph structured data and bridges the …
Efficient sparse collective communication and its application to accelerate distributed deep learning
Efficient collective communication is crucial to parallel-computing applications such as
distributed training of large-scale recommendation systems and natural language …
distributed training of large-scale recommendation systems and natural language …
Unlocking the power of inline {Floating-Point} operations on programmable switches
The advent of switches with programmable dataplanes has enabled the rapid development
of new network functionality, as well as providing a platform for acceleration of a broad …
of new network functionality, as well as providing a platform for acceleration of a broad …
[HTML][HTML] Distributed artificial intelligence: Taxonomy, review, framework, and reference architecture
Artificial intelligence (AI) research and market have grown rapidly in the last few years, and
this trend is expected to continue with many potential advancements and innovations in this …
this trend is expected to continue with many potential advancements and innovations in this …
Time-correlated sparsification for communication-efficient federated learning
Federated learning (FL) enables multiple clients to collaboratively train a shared model, with
the help of a parameter server (PS), without disclosing their local datasets. However, due to …
the help of a parameter server (PS), without disclosing their local datasets. However, due to …
Gemini: Fast failure recovery in distributed training with in-memory checkpoints
Large deep learning models have recently garnered substantial attention from both
academia and industry. Nonetheless, frequent failures are observed during large model …
academia and industry. Nonetheless, frequent failures are observed during large model …
Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications
GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN)
applications. However, DNN applications often underutilize GPUs, even when using large …
applications. However, DNN applications often underutilize GPUs, even when using large …
{PipeSwitch}: Fast pipelined context switching for deep learning applications
Deep learning (DL) workloads include throughput-intensive training tasks and latency-
sensitive inference tasks. The dominant practice today is to provision dedicated GPU …
sensitive inference tasks. The dominant practice today is to provision dedicated GPU …
[HTML][HTML] Dynamic and adaptive fault-tolerant asynchronous federated learning using volunteer edge devices
The number of devices, from smartphones to IoT hardware, interconnected via the Internet is
growing all the time. These devices produce a large amount of data that cannot be analyzed …
growing all the time. These devices produce a large amount of data that cannot be analyzed …
On the utility of gradient compression in distributed training systems
A rich body of prior work has highlighted the existence of communication bottlenecks in
synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent …
synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent …