Data lake management: challenges and opportunities

F Nargesian, E Zhu, RJ Miller, KQ Pu… - Proceedings of the VLDB …, 2019 - dl.acm.org
The ubiquity of data lakes has created fascinating new challenges for data management
research. In this tutorial, we review the state-of-the-art in data management for data lakes …

Dataset discovery and exploration: A survey

NW Paton, J Chen, Z Wu - ACM Computing Surveys, 2023 - dl.acm.org
Data scientists are tasked with obtaining insights from data. However, suitable data is often
not immediately at hand, and there may be many potentially relevant datasets in a data lake …

A survey on data collection for machine learning: a big data-ai integration perspective

Y Roh, G Heo, SE Whang - IEEE Transactions on Knowledge …, 2019 - ieeexplore.ieee.org
Data collection is a major bottleneck in machine learning and an active research topic in
multiple communities. There are largely two reasons data collection has recently become a …

Santos: Relationship-based semantic table union search

A Khatiwada, G Fan, R Shraga, Z Chen… - Proceedings of the …, 2023 - dl.acm.org
Existing techniques for unionable table search define unionability using metadata (tables
must have the same or similar schemas) or column-based metrics (for example, the values …

Sherlock: A deep learning approach to semantic data type detection

M Hulsebos, K Hu, M Bakker, E Zgraggen… - Proceedings of the 25th …, 2019 - dl.acm.org
Correctly detecting the semantic type of data columns is crucial for data science tasks such
as automated data cleaning, schema matching, and data discovery. Existing data …

Creating embeddings of heterogeneous relational datasets for data integration tasks

R Cappuzzo, P Papotti… - Proceedings of the 2020 …, 2020 - dl.acm.org
Deep learning based techniques have been recently used with promising results for data
integration problems. Some methods directly use pre-trained embeddings that were trained …

Semantics-aware dataset discovery from data lakes with contextualized column-based representation learning

G Fan, J Wang, Y Li, D Zhang, R Miller - arxiv preprint arxiv:2210.01922, 2022 - arxiv.org
Dataset discovery from data lakes is essential in many real application scenarios. In this
paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes …

Dataset discovery in data lakes

A Bogatu, AAA Fernandes, NW Paton… - 2020 ieee 36th …, 2020 - ieeexplore.ieee.org
Data analytics stands to benefit from the increasing availability of datasets that are held
without their conceptual relationships being explicitly known. When collected, these datasets …

Data management for machine learning: A survey

C Chai, J Wang, Y Luo, Z Niu… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Machine learning (ML) has widespread applications and has revolutionized many
industries, but suffers from several challenges. First, sufficient high-quality training data is …

Sato: Contextual semantic type detection in tables

D Zhang, Y Suhara, J Li, M Hulsebos… - arxiv preprint arxiv …, 2019 - arxiv.org
Detecting the semantic types of data columns in relational tables is important for various
data preparation and information retrieval tasks such as data cleaning, schema matching …