Web table extraction, retrieval, and augmentation: A survey
Tables are powerful and popular tools for organizing and manipulating data. A vast number
of tables can be found on the Web, which represent a valuable knowledge resource. The …
of tables can be found on the Web, which represent a valuable knowledge resource. The …
Jigsaw: Large language models meet program synthesis
Large pre-trained language models such as GPT-3 [10], Codex [11], and Google's language
model [7] are now capable of generating code from natural language specifications of …
model [7] are now capable of generating code from natural language specifications of …
Can foundation models wrangle your data?
Foundation Models (FMs) are models trained on large corpora of data that, at very large
scale, can generalize to new tasks without any task-specific finetuning. As these models …
scale, can generalize to new tasks without any task-specific finetuning. As these models …
Table-gpt: Table-tuned gpt for diverse table tasks
Language models, such as GPT-3.5 and ChatGPT, demonstrate remarkable abilities to
follow diverse human instructions and perform a wide range of tasks. However, when …
follow diverse human instructions and perform a wide range of tasks. However, when …
Auto-em: End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning
Entity matching (EM), also known as entity resolution, fuzzy join, and record linkage, refers to
the process of identifying records corresponding to the same real-world entities from …
the process of identifying records corresponding to the same real-world entities from …
Applications and challenges for large language models: From data management perspective
Data management is indispensable for informed decision-making in the big data era. In the
meantime, Large Language Models (LLMs), equipped with billions of model parameters and …
meantime, Large Language Models (LLMs), equipped with billions of model parameters and …
AutoPandas: neural-backed generators for program synthesis
Developers nowadays have to contend with a growing number of APIs. While in the long-
term they are very useful to developers, many modern APIs have an incredibly steep …
term they are very useful to developers, many modern APIs have an incredibly steep …
Auto-suggest: Learning-to-recommend data preparation steps using data science notebooks
Data preparation is widely recognized as the most time-consuming process in modern
business intelligence (BI) and machine learning (ML) projects. Automating complex data …
business intelligence (BI) and machine learning (ML) projects. Automating complex data …
Cleanml: A study for evaluating the impact of data cleaning on ml classification tasks
Data quality affects machine learning (ML) model performances, and data scientists spend
considerable amount of time on data cleaning before model training. However, to date, there …
considerable amount of time on data cleaning before model training. However, to date, there …
Jellyfish: Instruction-tuning local large language models for data preprocessing
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the
data mining pipeline that transforms raw data into a clean format. We instruction-tune local …
data mining pipeline that transforms raw data into a clean format. We instruction-tune local …