Using machine learning to optimize parallelism in big data applications

ÁB Hernández, MS Perez, S Gupta… - Future Generation …, 2018 - Elsevier
In-memory cluster computing platforms have gained momentum in the last years, due to their
ability to analyse big amounts of data in parallel. These platforms are complex and difficult-to …

High-performance design of apache spark with RDMA and its benefits on various workloads

X Lu, D Shankar, S Gugnani… - 2016 IEEE International …, 2016 - ieeexplore.ieee.org
The in-memory data processing framework, Apache Spark, has been stealing the limelight
for low-latency interactive applications, iterative and batch computations. Our early …

Kosmo: Efficient online miss ratio curve generation for eviction policy evaluation

K Shakiba, S Sultan, M Stumm - 22nd USENIX Conference on File and …, 2024 - usenix.org
In-memory caches play an important role in reducing the load on backend storage servers
for many workloads. Miss ratio curves (MRCs) are an important tool for configuring these …

{ExtMem}: Enabling {Application-Aware} Virtual Memory Management for {Data-Intensive} Applications

S Jalalian, S Patel, MR Hajidehi, M Seltzer… - 2024 USENIX annual …, 2024 - usenix.org
For over forty years, researchers have demonstrated that operating system memory
managers often fall short in supporting memory-hungry applications. The problem is even …

LRC: Dependency-aware cache management for data analytics clusters

Y Yu, W Wang, J Zhang… - IEEE INFOCOM 2017-IEEE …, 2017 - ieeexplore.ieee.org
Memory caches are being aggressively used in today's data-parallel systems such as Spark,
Tez, and Piccolo. However, prevalent systems employ rather simple cache management …

Agile-Ant: Self-managing Distributed Cache Management for Cost Optimization of Big Data Applications

H Al-Sayeh, MA Jibril, KU Sattler - Proceedings of the VLDB Endowment, 2024 - dl.acm.org
Distributed in-memory processing frameworks accelerate application runs by caching
important datasets in memory. Allocating a suitable cluster configuration for caching these …

Improving spark application throughput via memory aware task co-location: A mixture of experts approach

VS Marco, B Taylor, B Porter, Z Wang - … of the 18th ACM/IFIP/USENIX …, 2017 - dl.acm.org
Data analytic applications built upon big data processing frameworks such as Apache Spark
are an important class of applications. Many of these applications are not latency-sensitive …

Dynamic memory-aware scheduling in spark computing environment

Z Tang, A Zeng, X Zhang, L Yang, K Li - Journal of Parallel and Distributed …, 2020 - Elsevier
Scheduling plays an important role in improving the performance of big data-parallel
processing. Spark is an in-memory parallel computing framework that uses a multi-threaded …

Reference-distance eviction and prefetching for cache management in spark

TBG Perez, X Zhou, D Cheng - … of the 47th International Conference on …, 2018 - dl.acm.org
Optimizing memory cache usage is vital for performance of in-memory data-parallel
frameworks such as Spark. Current data-analytic frameworks utilize the popular Least …

Intermediate data caching optimization for multi-stage and parallel big data frameworks

Z Yang, D Jia, S Ioannidis, N Mi… - 2018 IEEE 11th …, 2018 - ieeexplore.ieee.org
In the era of big data and cloud computing, large amounts of data are generated from user
applications and need to be processed in the datacenter. Data-parallel computing …