A comprehensive survey on coded distributed computing: Fundamentals, challenges, and networking applications

JS Ng, WYB Lim, NC Luong, Z **ong… - … Surveys & Tutorials, 2021 - ieeexplore.ieee.org
Distributed computing has become a common approach for large-scale computation tasks
due to benefits such as high reliability, scalability, computation speed, and cost …

Distributed data management using MapReduce

F Li, BC Ooi, MT Özsu, S Wu - ACM Computing Surveys (CSUR), 2014 - dl.acm.org
MapReduce is a framework for processing and managing large-scale datasets in a
distributed cluster, which has been used for applications such as generating search indexes …

Ray: A distributed framework for emerging {AI} applications

P Moritz, R Nishihara, S Wang, A Tumanov… - … USENIX symposium on …, 2018 - usenix.org
The next generation of AI applications will continuously interact with the environment and
learn from these interactions. These applications impose new and demanding systems …

Resource management with deep reinforcement learning

H Mao, M Alizadeh, I Menache, S Kandula - Proceedings of the 15th …, 2016 - dl.acm.org
Resource management problems in systems and networking often manifest as difficult
online decision making tasks where appropriate solutions depend on understanding the …

Speeding up distributed machine learning using codes

K Lee, M Lam, R Pedarsani… - IEEE Transactions …, 2017 - ieeexplore.ieee.org
Codes are widely used in many engineering applications to offer robustness against noise.
In large-scale systems, there are several types of noise that can affect the performance of …

Shuffling, fast and slow: Scalable analytics on serverless infrastructure

Q Pu, S Venkataraman, I Stoica - 16th USENIX symposium on networked …, 2019 - usenix.org
Serverless computing is poised to fulfill the long-held promise of transparent elasticity and
millisecond-level pricing. To achieve this goal, service providers impose a finegrained …

Ernest: Efficient performance prediction for {Large-Scale} advanced analytics

S Venkataraman, Z Yang, M Franklin, B Recht… - … USENIX Symposium on …, 2016 - usenix.org
Recent workload trends indicate rapid growth in the deployment of machine learning,
genomics and scientific workloads on cloud computing infrastructure. However, efficiently …

Quasar: Resource-efficient and qos-aware cluster management

C Delimitrou, C Kozyrakis - ACM Sigplan Notices, 2014 - dl.acm.org
Cloud computing promises flexibility and high performance for users and high cost-efficiency
for operators. Nevertheless, most cloud facilities operate at very low utilization, hurting both …

Efficient coflow scheduling with varys

M Chowdhury, Y Zhong, I Stoica - … of the 2014 ACM conference on …, 2014 - dl.acm.org
Communication in data-parallel applications often involves a collection of parallel flows.
Traditional techniques to optimize flow-level metrics do not perform well in optimizing such …

Sparrow: distributed, low latency scheduling

K Ousterhout, P Wendell, M Zaharia… - Proceedings of the twenty …, 2013 - dl.acm.org
Large-scale data analytics frameworks are shifting towards shorter task durations and larger
degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete …