Slow and stale gradients can win the race
Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers
from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous …
from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous …
Optimal server selection for straggler mitigation
The performance of large-scale distributed compute systems is adversely impacted by
stragglers when the execution time of a job is uncertain. To manage stragglers, we consider …
stragglers when the execution time of a job is uncertain. To manage stragglers, we consider …
Vision paper: Grand challenges in resilience: Autonomous system resilience through design and runtime measures
In this article, we put forward the substantial challenges in cyber resilience in the domain of
autonomous systems and outline foundational solutions to address these challenges. These …
autonomous systems and outline foundational solutions to address these challenges. These …
Straggler mitigation with tiered gradient codes
Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD)
on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job …
on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job …
Single-forking of coded subtasks for straggler mitigation
Given the unpredictable nature of the nodes in distributed computing systems, some of the
tasks can be significantly delayed. Such delayed tasks are called stragglers. Straggler …
tasks can be significantly delayed. Such delayed tasks are called stragglers. Straggler …
Low latency replication coded storage over memory-constrained servers
We consider a distributed storage system storing a single file, where the file is divided into
equal sized fragments. The fragments are replicated with a common replication factor, and …
equal sized fragments. The fragments are replicated with a common replication factor, and …
Modeling and optimization of latency in erasure-coded storage systems
V Aggarwal, T Lan - arxiv preprint arxiv:2005.10855, 2020 - arxiv.org
As consumers are increasingly engaged in social networking and E-commerce activities,
businesses grow to rely on Big Data analytics for intelligence, and traditional IT …
businesses grow to rely on Big Data analytics for intelligence, and traditional IT …
VidCloud: Joint Stall and Quality Optimization for Video Streaming over Cloud
AO Al-Abbasi, V Aggarwal - … on Modeling and Performance Evaluation of …, 2021 - dl.acm.org
As video-streaming services have expanded and improved, cloud-based video has evolved
into a necessary feature of any successful business for reaching internal and external …
into a necessary feature of any successful business for reaching internal and external …
Latency optimal storage and scheduling of replicated fragments for memory constrained servers
We consider the setting of a distributed storage system where a single file is subdivided into
smaller fragments of same size which are then replicated with a common replication factor …
smaller fragments of same size which are then replicated with a common replication factor …
Detection of stragglers and optimal rescheduling of slow running tasks in big data environment using LFCSO-LVQ classifier and enhanced PSO algorithm
This paper plans to implement intelligent techniques in finding straggler tasks along with
speculating their way of execution. Here, the LFCSO-LVQ is proposed to effectively identify …
speculating their way of execution. Here, the LFCSO-LVQ is proposed to effectively identify …