Data lake management: challenges and opportunities

F Nargesian, E Zhu, RJ Miller, KQ Pu… - Proceedings of the VLDB …, 2019 - dl.acm.org
The ubiquity of data lakes has created fascinating new challenges for data management
research. In this tutorial, we review the state-of-the-art in data management for data lakes …

[PDF][PDF] Data Integration: The Current Status and the Way Forward.

M Stonebraker, IF Ilyas - IEEE Data Eng. Bull., 2018 - cs.uwaterloo.ca
We discuss scalable data integration challenges in the enterprise inspired by our
experience at Tamr1. We use multiple real customer examples to highlight the technical …

A survey on data collection for machine learning: a big data-ai integration perspective

Y Roh, G Heo, SE Whang - IEEE Transactions on Knowledge …, 2019 - ieeexplore.ieee.org
Data collection is a major bottleneck in machine learning and an active research topic in
multiple communities. There are largely two reasons data collection has recently become a …

[LIVRE][B] Magellan: Toward building entity matching management systems

PV Konda - 2018 - search.proquest.com
Entity matching (EM) identifies data instances that refer to the same real-world entity, such
as (David Smith, UWMadison) and (DM Smith, UWM). This problem has been a long …

Deepeye: Towards automatic data visualization

Y Luo, X Qin, N Tang, G Li - 2018 IEEE 34th international …, 2018 - ieeexplore.ieee.org
Data visualization is invaluable for explaining the significance of data to people who are
visually oriented. The central task of automatic data visualization is, given a dataset, to …

Annotating columns with pre-trained language models

Y Suhara, J Li, Y Li, D Zhang, Ç Demiralp… - Proceedings of the …, 2022 - dl.acm.org
Inferring meta information about tables, such as column headers or relationships between
columns, is an active research topic in data management as we find many tables are …

Josie: Overlap set similarity search for finding joinable tables in data lakes

E Zhu, D Deng, F Nargesian, RJ Miller - Proceedings of the 2019 …, 2019 - dl.acm.org
We present a new solution for finding joinable tables in massive data lakes: given a table
and one join column, find tables that can be joined with the given table on the largest …

Data market platforms: Trading data assets to solve data problems

RC Fernandez, P Subramaniam… - arxiv preprint arxiv …, 2020 - arxiv.org
Data only generates value for a few organizations with expertise and resources to make
data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover …

Finding related tables in data lakes for interactive data science

Y Zhang, ZG Ives - Proceedings of the 2020 ACM SIGMOD International …, 2020 - dl.acm.org
Many modern data science applications build on data lakes, schema-agnostic repositories
of data files and data products that offer limited organization and management capabilities …

Raha: A configuration-free error detection system

M Mahdavi, Z Abedjan, R Castro Fernandez… - Proceedings of the …, 2019 - dl.acm.org
Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually
require a user to provide input configurations in the form of rules or statistical parameters …