Slow and stale gradients can win the race

S Dutta, J Wang, G Joshi - IEEE Journal on Selected Areas in …, 2021 - ieeexplore.ieee.org
Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers
from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous …

Optimal server selection for straggler mitigation

A Badita, P Parag, V Aggarwal - IEEE/ACM Transactions on …, 2020 - ieeexplore.ieee.org
The performance of large-scale distributed compute systems is adversely impacted by
stragglers when the execution time of a job is uncertain. To manage stragglers, we consider …

Vision paper: Grand challenges in resilience: Autonomous system resilience through design and runtime measures

S Bagchi, V Aggarwal, S Chaterji… - IEEE Open Journal …, 2020 - ieeexplore.ieee.org
In this article, we put forward the substantial challenges in cyber resilience in the domain of
autonomous systems and outline foundational solutions to address these challenges. These …

Straggler mitigation with tiered gradient codes

S Sasi, V Lalitha, V Aggarwal… - IEEE Transactions on …, 2020 - ieeexplore.ieee.org
Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD)
on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job …

Single-forking of coded subtasks for straggler mitigation

A Badita, P Parag, V Aggarwal - IEEE/ACM Transactions on …, 2021 - ieeexplore.ieee.org
Given the unpredictable nature of the nodes in distributed computing systems, some of the
tasks can be significantly delayed. Such delayed tasks are called stragglers. Straggler …

Low latency replication coded storage over memory-constrained servers

R **an, A Badita, P Sarvepalli… - 2021 IEEE International …, 2021 - ieeexplore.ieee.org
We consider a distributed storage system storing a single file, where the file is divided into
equal sized fragments. The fragments are replicated with a common replication factor, and …

Modeling and optimization of latency in erasure-coded storage systems

V Aggarwal, T Lan - arxiv preprint arxiv:2005.10855, 2020 - arxiv.org
As consumers are increasingly engaged in social networking and E-commerce activities,
businesses grow to rely on Big Data analytics for intelligence, and traditional IT …

VidCloud: Joint Stall and Quality Optimization for Video Streaming over Cloud

AO Al-Abbasi, V Aggarwal - … on Modeling and Performance Evaluation of …, 2021 - dl.acm.org
As video-streaming services have expanded and improved, cloud-based video has evolved
into a necessary feature of any successful business for reaching internal and external …

Latency optimal storage and scheduling of replicated fragments for memory constrained servers

R **an, A Badita, PK Sarvepalli… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
We consider the setting of a distributed storage system where a single file is subdivided into
smaller fragments of same size which are then replicated with a common replication factor …

Detection of stragglers and optimal rescheduling of slow running tasks in big data environment using LFCSO-LVQ classifier and enhanced PSO algorithm

HA Joshiara, CS Thaker, SM Shah… - … Journal of Data …, 2022 - inderscienceonline.com
This paper plans to implement intelligent techniques in finding straggler tasks along with
speculating their way of execution. Here, the LFCSO-LVQ is proposed to effectively identify …