Tabular data augmentation for machine learning: Progress and prospects of embracing generative ai

L Cui, H Li, K Chen, L Shou, G Chen - arxiv preprint arxiv:2407.21523, 2024 - arxiv.org
Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality
tabular data for model training remains a significant obstacle. Numerous works have …

CHORUS: foundation models for unified data discovery and exploration

M Kayali, A Lykov, I Fountalis, N Vasiloglou… - arxiv preprint arxiv …, 2023 - arxiv.org
We explore the application of foundation models to data discovery and exploration tasks.
Foundation models are large language models (LLMs) that show promising performance on …

TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes

A Khatiwada, H Kokel, I Abdelaziz… - arxiv preprint arxiv …, 2024 - arxiv.org
Enterprises have a growing need to identify relevant tables in data lakes; eg tables that are
unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such …

Retrieve, merge, predict: Augmenting tables with data lakes

R Cappuzzo, A Coelho, F Lefebvre, P Papotti… - arxiv preprint arxiv …, 2024 - arxiv.org
Machine-learning from a disparate set of tables, a data lake, requires assembling features
by merging and aggregating tables. Data discovery can extend autoML to data tables by …

A Survey on Data Markets

J Zhang, Y Bi, M Cheng, J Liu, K Ren, Q Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Data is the new oil of the 21st century. The growing trend of trading data for greater welfare
has led to the emergence of data markets. A data market is any mechanism whereby the …

Towards Accurate and Efficient Document Analytics with Large Language Models

Y Lin, M Hulsebos, R Ma, S Shankar… - arxiv preprint arxiv …, 2024 - arxiv.org
Unstructured data formats account for over 80% of the data currently stored, and extracting
value from such formats remains a considerable challenge. In particular, current approaches …

LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes

Y Deng, C Chai, L Cao, Q Yuan, S Chen, Y Yu… - Proceedings of the …, 2024 - dl.acm.org
Discovering tables from poorly maintained data lakes is a significant challenge in data
management. Two key tasks are identifying joinable and unionable tables, crucial for data …

Searching Data Lakes for Nested and Joined Data

Y Zhang, PB Chen, ZG Ives - Proceedings of the VLDB Endowment, 2024 - dl.acm.org
Exploratory data science is driving new platforms that assist data scientists with everyday
tasks, such as integration and wrangling, to assemble training datasets. Such tools take …

Graph Machine Learning Meets Multi-Table Relational Data

Q Gan, M Wang, D Wipf, C Faloutsos - Proceedings of the 30th ACM …, 2024 - dl.acm.org
While graph machine learning, and notably graph neural networks (GNNs), have gained
immense traction in recent years, application is predicated on access to a known input graph …

NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns

P Subramaniam, U Khurana, K Srinivas… - Proceedings of the …, 2023 - dl.acm.org
Join discovery is a crucial part of exploration on data lakes. It often involves finding joinable
tables that are semantically relevant. However, data lakes often contain numeric tables with …