Web table extraction, retrieval, and augmentation: A survey

S Zhang, K Balog - ACM Transactions on Intelligent Systems and …, 2020 - dl.acm.org
Tables are powerful and popular tools for organizing and manipulating data. A vast number
of tables can be found on the Web, which represent a valuable knowledge resource. The …

Jigsaw: Large language models meet program synthesis

N Jain, S Vaidyanath, A Iyer, N Natarajan… - Proceedings of the 44th …, 2022 - dl.acm.org
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language
model [7] are now capable of generating code from natural language specifications of …

Can foundation models wrangle your data?

A Narayan, I Chami, L Orr, S Arora, C Ré - arxiv preprint arxiv:2205.09911, 2022 - arxiv.org
Foundation Models (FMs) are models trained on large corpora of data that, at very large
scale, can generalize to new tasks without any task-specific finetuning. As these models …

Table-gpt: Table-tuned gpt for diverse table tasks

P Li, Y He, D Yashar, W Cui, S Ge, H Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Language models, such as GPT-3.5 and ChatGPT, demonstrate remarkable abilities to
follow diverse human instructions and perform a wide range of tasks. However, when …

Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning

C Zhao, Y He - The World Wide Web Conference, 2019 - dl.acm.org
Entity matching (EM), also known as entity resolution, fuzzy join, and record linkage, refers to
the process of identifying records corresponding to the same real-world entities from …

Applications and challenges for large language models: From data management perspective

M Zhang, Z Ji, Z Luo, Y Wu… - 2024 IEEE 40th …, 2024 - ieeexplore.ieee.org
Data management is indispensable for informed decision-making in the big data era. In the
meantime, Large Language Models (LLMs), equipped with billions of model parameters and …

AutoPandas: neural-backed generators for program synthesis

R Bavishi, C Lemieux, R Fox, K Sen… - Proceedings of the ACM on …, 2019 - dl.acm.org
Developers nowadays have to contend with a growing number of APIs. While in the long-
term they are very useful to developers, many modern APIs have an incredibly steep …

Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks

C Yan, Y He - Proceedings of the 2020 ACM SIGMOD International …, 2020 - dl.acm.org
Data preparation is widely recognized as the most time-consuming process in modern
business intelligence (BI) and machine learning (ML) projects. Automating complex data …

Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks

P Li, X Rao, J Blase, Y Zhang, X Chu… - 2021 IEEE 37th …, 2021 - ieeexplore.ieee.org
Data quality affects machine learning (ML) model performances, and data scientists spend
considerable amount of time on data cleaning before model training. However, to date, there …

Jellyfish: Instruction-tuning local large language models for data preprocessing

H Zhang, Y Dong, C **ao… - Proceedings of the 2024 …, 2024 - aclanthology.org
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the
data mining pipeline that transforms raw data into a clean format. We instruction-tune local …