Distributed data management using MapReduce

F Li, BC Ooi, MT Özsu, S Wu - ACM Computing Surveys (CSUR), 2014 - dl.acm.org
MapReduce is a framework for processing and managing large-scale datasets in a
distributed cluster, which has been used for applications such as generating search indexes …

A comprehensive view of Hadoop research—A systematic literature review

I Polato, R Ré, A Goldman, F Kon - Journal of Network and Computer …, 2014 - Elsevier
Context: In recent years, the valuable knowledge that can be retrieved from petabyte scale
datasets–known as Big Data–led to the development of solutions to process information …

Making sense of performance in data analytics frameworks

K Ousterhout, R Rasti, S Ratnasamy… - … USENIX Symposium on …, 2015 - usenix.org
There has been much research devoted to improving the performance of data analytics
frameworks, but comparatively little effort has been spent systematically identifying the …

Sprocket: A serverless video processing framework

L Ao, L Izhikevich, GM Voelker, G Porter - Proceedings of the ACM …, 2018 - dl.acm.org
Sprocket is a highly configurable, stage-based, scalable, serverless video processing
framework that exploits intra-video parallelism to achieve low latency. Sprocket enables …

[책][B] Magellan: Toward building entity matching management systems

PV Konda - 2018 - search.proquest.com
Entity matching (EM) identifies data instances that refer to the same real-world entity, such
as (David Smith, UWMadison) and (DM Smith, UWM). This problem has been a long …

Neural acceleration for general-purpose approximate programs

H Esmaeilzadeh, A Sampson, L Ceze… - 2012 45th annual …, 2012 - ieeexplore.ieee.org
This paper describes a learning-based approach to the acceleration of approximate
programs. We describe the Parrot transformation, a program transformation that selects and …

Shark: SQL and rich analytics at scale

RS **n, J Rosen, M Zaharia, MJ Franklin… - Proceedings of the …, 2013 - dl.acm.org
Shark is a new data analysis system that marries query processing with complex analytics
on large clusters. It leverages a novel distributed memory abstraction to provide a unified …

Communication steps for parallel query processing

P Beame, P Koutris, D Suciu - Journal of the ACM (JACM), 2017 - dl.acm.org
We study the problem of computing conjunctive queries over large databases on parallel
architectures without shared storage. Using the structure of such a query q and the skew in …

Locationspark: A distributed in-memory data management system for big spatial data

M Tang, Y Yu, QM Malluhi, M Ouzzani… - Proceedings of the VLDB …, 2016 - dl.acm.org
We present LocationSpark, a spatial data processing system built on top of Apache Spark, a
widely used distributed data processing system. LocationSpark offers a rich set of spatial …

Three steps is all you need: fast, accurate, automatic scaling decisions for distributed streaming dataflows

V Kalavri, J Liagouris, M Hoffmann… - … USENIX Symposium on …, 2018 - usenix.org
Streaming computations are by nature long-running, and their workloads can change in
unpredictable ways. This in turn means that maintaining performance may require …