Distributed artificial intelligence empowered by end-edge-cloud computing: A survey
As the computing paradigm shifts from cloud computing to end-edge-cloud computing, it
also supports artificial intelligence evolving from a centralized manner to a distributed one …
also supports artificial intelligence evolving from a centralized manner to a distributed one …
Edge-cloud polarization and collaboration: A comprehensive survey for ai
Influenced by the great success of deep learning via cloud computing and the rapid
development of edge chips, research in artificial intelligence (AI) has shifted to both of the …
development of edge chips, research in artificial intelligence (AI) has shifted to both of the …
[HTML][HTML] Pre-trained models: Past, present and future
Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved
great success and become a milestone in the field of artificial intelligence (AI). Owing to …
great success and become a milestone in the field of artificial intelligence (AI). Owing to …
Scaling distributed machine learning with {In-Network} aggregation
Training machine learning models in parallel is an increasingly important workload. We
accelerate distributed parallel training by designing a communication primitive that uses a …
accelerate distributed parallel training by designing a communication primitive that uses a …
Decentralized training of foundation models in heterogeneous environments
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive, often
involving tens of thousands of GPUs running continuously for months. These models are …
involving tens of thousands of GPUs running continuously for months. These models are …
{ATP}: In-network aggregation for multi-tenant learning
Distributed deep neural network training (DT) systems are widely deployed in clusters where
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …
the network is shared across multiple tenants, ie, multiple DT jobs. Each DT job computes …
Rdma over ethernet for distributed training at meta scale
The rapid growth in both computational density and scale in AI models in recent years
motivates the construction of an efficient and reliable dedicated network infrastructure. This …
motivates the construction of an efficient and reliable dedicated network infrastructure. This …
Power-aware Deep Learning Model Serving with {μ-Serve}
With the increasing popularity of large deep learning model-serving workloads, there is a
pressing need to reduce the energy consumption of a model-serving cluster while …
pressing need to reduce the energy consumption of a model-serving cluster while …
{SRNIC}: A scalable architecture for {RDMA}{NICs}
RDMA is expected to be highly scalable: to perform well in large-scale data center networks
where packet losses are inevitable (ie, high network scalability), and to support a large …
where packet losses are inevitable (ie, high network scalability), and to support a large …
[PDF][PDF] MAST: Global scheduling of ML training across Geo-Distributed datacenters at hyperscale
A Choudhury, Y Wang, T Pelkonen… - 18th USENIX …, 2024 - yangwang83.github.io
In public clouds, users must manually select a datacenter region to upload their ML training
data and launch ML training workloads in the same region to ensure data and computation …
data and launch ML training workloads in the same region to ensure data and computation …