Tabular data augmentation for machine learning: Progress and prospects of embracing generative ai
Machine learning (ML) on tabular data is ubiquitous, yet obtaining abundant high-quality
tabular data for model training remains a significant obstacle. Numerous works have …
tabular data for model training remains a significant obstacle. Numerous works have …
CHORUS: foundation models for unified data discovery and exploration
We explore the application of foundation models to data discovery and exploration tasks.
Foundation models are large language models (LLMs) that show promising performance on …
Foundation models are large language models (LLMs) that show promising performance on …
TabSketchFM: Sketch-based Tabular Representation Learning for Data Discovery over Data Lakes
Enterprises have a growing need to identify relevant tables in data lakes; eg tables that are
unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such …
unionable, joinable, or subsets of each other. Tabular neural models can be helpful for such …
Retrieve, merge, predict: Augmenting tables with data lakes
Machine-learning from a disparate set of tables, a data lake, requires assembling features
by merging and aggregating tables. Data discovery can extend autoML to data tables by …
by merging and aggregating tables. Data discovery can extend autoML to data tables by …
A Survey on Data Markets
Data is the new oil of the 21st century. The growing trend of trading data for greater welfare
has led to the emergence of data markets. A data market is any mechanism whereby the …
has led to the emergence of data markets. A data market is any mechanism whereby the …
Towards Accurate and Efficient Document Analytics with Large Language Models
Unstructured data formats account for over 80% of the data currently stored, and extracting
value from such formats remains a considerable challenge. In particular, current approaches …
value from such formats remains a considerable challenge. In particular, current approaches …
LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes
Discovering tables from poorly maintained data lakes is a significant challenge in data
management. Two key tasks are identifying joinable and unionable tables, crucial for data …
management. Two key tasks are identifying joinable and unionable tables, crucial for data …
Searching Data Lakes for Nested and Joined Data
Exploratory data science is driving new platforms that assist data scientists with everyday
tasks, such as integration and wrangling, to assemble training datasets. Such tools take …
tasks, such as integration and wrangling, to assemble training datasets. Such tools take …
Graph Machine Learning Meets Multi-Table Relational Data
While graph machine learning, and notably graph neural networks (GNNs), have gained
immense traction in recent years, application is predicated on access to a known input graph …
immense traction in recent years, application is predicated on access to a known input graph …
NumJoin: Discovering Numeric Joinable Tables with Semantically Related Columns
Join discovery is a crucial part of exploration on data lakes. It often involves finding joinable
tables that are semantically relevant. However, data lakes often contain numeric tables with …
tables that are semantically relevant. However, data lakes often contain numeric tables with …