Metam: Goal-oriented data discovery

S Galhotra, Y Gong… - 2023 IEEE 39th …, 2023 - ieeexplore.ieee.org
Data is a central component of machine learning and causal inference tasks. The availability
of large amounts of data from sources such as open data repositories, data lakes and data …

Observatory: Characterizing embeddings of relational tables

T Cong, M Hulsebos, Z Sun, P Groth… - arxiv preprint arxiv …, 2023 - arxiv.org
Language models and specialized table embedding models have recently demonstrated
strong performance on many tasks over tabular data. Researchers and practitioners are …

Retrieve, merge, predict: Augmenting tables with data lakes

R Cappuzzo, A Coelho, F Lefebvre, P Papotti… - arxiv preprint arxiv …, 2024 - arxiv.org
Machine-learning from a disparate set of tables, a data lake, requires assembling features
by merging and aggregating tables. Data discovery can extend autoML to data tables by …

Warpgate: A semantic join discovery system for cloud data warehouses

T Cong, J Gale, J Frantz, HV Jagadish… - arxiv preprint arxiv …, 2022 - arxiv.org
Data discovery is a major challenge in enterprise data analysis: users often struggle to find
data relevant to their analysis goals or even to navigate through data across data sources …

UniDM: A Unified Framework for Data Manipulation with Large Language Models

Y Qian, Y He, R Zhu, J Huang, Z Ma… - Proceedings of …, 2024 - proceedings.mlsys.org
Designing effective data manipulation methods is a long standing problem in data lakes.
Traditional methods, which rely on rules or machine learning models, require extensive …

Towards an architecture to support data access in research data spaces

J Möller, D Jankowski, A Hahn - 2021 IEEE 22nd International …, 2021 - ieeexplore.ieee.org
Using data from different data sources is a common procedure in data-driven research. As
required data is often not available from centrally managed sources, the concept of data …

Suggesting assess queries for interactive analysis of multidimensional data

M Francia, M Golfarelli, P Marcel, S Rizzi… - … on Knowledge and …, 2022 - ieeexplore.ieee.org
Assessment is the process of comparing the actual to the expected behavior of a business
phenomenon and judging the outcome of the comparison. The querying operator has been …

FREYJA: Efficient Join Discovery in Data Lakes

M Maynou, S Nadal, R Panadero, J Flores… - arxiv preprint arxiv …, 2024 - arxiv.org
Data lakes are massive repositories of raw and heterogeneous data, designed to meet the
requirements of modern data storage. Nonetheless, this same philosophy increases the …

It Took Longer than I was Expecting: Why is Dataset Search Still so Hard?

M Hulsebos, W Lin, S Shankar… - Proceedings of the 2024 …, 2024 - dl.acm.org
Dataset search is a long-standing problem across both industry and academia. While most
industry tools focus on identifying one or more datasets matching a user-specified query …

[BUCH][B] Table Representation Learning

M Hulsebos - 2024 - pure.uva.nl
The increasing amount of data being collected, stored, and analyzed, induces a need for
efficient, scalable, and robust methods to handle this data. Representation learning, ie, the …