Data-centric artificial intelligence: A survey

D Zha, ZP Bhat, KH Lai, F Yang, Z Jiang… - ACM Computing …, 2025 - dl.acm.org
Artificial Intelligence (AI) is making a profound impact in almost every domain. A vital enabler
of its great success is the availability of abundant and high-quality data for building machine …

Data lake management: challenges and opportunities

F Nargesian, E Zhu, RJ Miller, KQ Pu… - Proceedings of the VLDB …, 2019 - dl.acm.org
The ubiquity of data lakes has created fascinating new challenges for data management
research. In this tutorial, we review the state-of-the-art in data management for data lakes …

Transtab: Learning transferable tabular transformers across tables

Z Wang, J Sun - Advances in Neural Information Processing …, 2022 - proceedings.neurips.cc
Tabular data (or tables) are the most widely used data format in machine learning (ML).
However, ML models often assume the table structure keeps fixed in training and testing …

Santos: Relationship-based semantic table union search

A Khatiwada, G Fan, R Shraga, Z Chen… - Proceedings of the …, 2023 - dl.acm.org
Existing techniques for unionable table search define unionability using metadata (tables
must have the same or similar schemas) or column-based metrics (for example, the values …

Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning

G Fan, J Wang, Y Li, D Zhang, R Miller - arxiv preprint arxiv:2210.01922, 2022 - arxiv.org
Dataset discovery from data lakes is essential in many real application scenarios. In this
paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes …

Josie: Overlap set similarity search for finding joinable tables in data lakes

E Zhu, D Deng, F Nargesian, RJ Miller - Proceedings of the 2019 …, 2019 - dl.acm.org
We present a new solution for finding joinable tables in massive data lakes: given a table
and one join column, find tables that can be joined with the given table on the largest …

Dataset discovery in data lakes

A Bogatu, AAA Fernandes, NW Paton… - 2020 ieee 36th …, 2020 - ieeexplore.ieee.org
Data analytics stands to benefit from the increasing availability of datasets that are held
without their conceptual relationships being explicitly known. When collected, these datasets …

Integrating data lake tables

A Khatiwada, R Shraga, W Gatterbauer… - Proceedings of the VLDB …, 2022 - dl.acm.org
We have made tremendous strides in providing tools for data scientists to discover new
tables useful for their analyses. But despite these advances, the proper integration of …

Data lakes: A survey of functions and systems

R Hai, C Koutras, C Quix… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Data lakes are becoming increasingly prevalent for Big Data management and data
analytics. In contrast to traditional 'schema-on-write'approaches such as data warehouses …

Data management for machine learning: A survey

C Chai, J Wang, Y Luo, Z Niu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Machine learning (ML) has widespread applications and has revolutionized many
industries, but suffers from several challenges. First, sufficient high-quality training data is …