Fedscale: Benchmarking model and system performance of federated learning at scale

F Lai, Y Dai, S Singapuram, J Liu… - International …, 2022 - proceedings.mlr.press
We present FedScale, a federated learning (FL) benchmarking suite with realistic datasets
and a scalable runtime to enable reproducible FL research. FedScale datasets encompass …

Oort: Efficient federated learning via guided participant selection

F Lai, X Zhu, HV Madhyastha… - 15th {USENIX} Symposium …, 2021 - usenix.org
Federated Learning (FL) is an emerging direction in distributed machine learning (ML) that
enables in-situ model training and testing on edge data. Despite having the same end goals …

Skyplane: Optimizing transfer cost and throughput using {Cloud-Aware} overlays

P Jain, S Kumar, S Wooders, SG Patil… - … USENIX Symposium on …, 2023 - usenix.org
Cloud applications are increasingly distributing data across multiple regions and cloud
providers. Unfortunately, widearea bulk data transfers are often slow, bottlenecking …

Auxo: Efficient federated learning via scalable client clustering

J Liu, F Lai, Y Dai, A Akella, HV Madhyastha… - Proceedings of the …, 2023 - dl.acm.org
Federated learning (FL) is an emerging machine learning (ML) paradigm that enables
heterogeneous edge devices to collaboratively train ML models without revealing their raw …

{Fault-Tolerant} replication with {Pull-Based} consensus in {MongoDB}

S Zhou, S Mu - 18th USENIX Symposium on Networked Systems …, 2021 - usenix.org
In this paper, we present the design and implementation of strongly consistent replication in
MongoDB. MongoDB provides linearizability and tolerates any minority of failures through a …

Network cost-aware geo-distributed data analytics system

K Oh, M Zhang, A Chandra… - IEEE Transactions on …, 2021 - ieeexplore.ieee.org
Many geo-distributed data analytics (GDA) systems have focused on the network
performance-bottleneck: inter-data center network bandwidth to improve performance …

WASP: Wide-area adaptive stream processing

A Jonathan, A Chandra, J Weissman - Proceedings of the 21st …, 2020 - dl.acm.org
Adaptability is critical for stream processing systems to ensure stable, low-latency, and high-
throughput processing of long-running queries. Such adaptability is particularly challenging …

Sol: Fast distributed computation over slow networks

F Lai, J You, X Zhu, HV Madhyastha… - … USENIX Symposium on …, 2020 - usenix.org
The popularity of big data and AI has led to many optimizations at different layers of
distributed computation stacks. Despite–or perhaps, because of–its role as the narrow waist …

Efficient inter-datacenter ALLReduce with multiple trees

S Luo, R Wang, H **ng - IEEE Transactions on Network …, 2024 - ieeexplore.ieee.org
In this paper, we look into the problem of achieving efficient inter-datacenter AllReduce
operations for geo-distributed machine learning (Geo-DML). Compared with intra-datacenter …

Aggnet: Cost-aware aggregation networks for geo-distributed streaming analytics

D Kumar, S Ahmad, A Chandra… - 2021 IEEE/ACM …, 2021 - ieeexplore.ieee.org
Large-scale real-time analytics services continuously collect and analyze data from end-
user applications and devices distributed around the globe. Such analytics requires data to …