Detecting data errors: Where are we and what needs to be done?

Z Abedjan, X Chu, D Deng, RC Fernandez… - Proceedings of the …, 2016 - dl.acm.org
Data cleaning has played a critical role in ensuring data quality for enterprise applications.
Naturally, there has been extensive research in this area, and many data cleaning …

Data cleansing mechanisms and approaches for big data analytics: a systematic study

M Hosseinzadeh, E Azhir, OH Ahmed… - Journal of Ambient …, 2023 - Springer
With the evolution of new technologies, the production of digital data is constantly growing. It
is thus necessary to develop data management strategies in order to handle the large-scale …

[HTML][HTML] Steering data quality with visual analytics: The complexity challenge

S Liu, G Andrienko, Y Wu, N Cao, L Jiang, C Shi… - Visual Informatics, 2018 - Elsevier
Data quality management, especially data cleansing, has been extensively studied for many
years in the areas of data management and visual analytics. In the paper, we first review and …

Holodetect: Few-shot learning for error detection

A Heidari, J McGrath, IF Ilyas… - Proceedings of the 2019 …, 2019 - dl.acm.org
We introduce a few-shot learning framework for error detection. We show that data
augmentation (a form of weak supervision) is key to training high-quality, ML-based error …

Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond

Z Miao, Y Li, X Wang - … of the 2021 International Conference on …, 2021 - dl.acm.org
Deep Learning revolutionizes almost all fields of computer science including data
management. However, the demand for high-quality training data is slowing down deep …

Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation

R Wang, Y Li, J Wang - 2023 IEEE 39th International …, 2023 - ieeexplore.ieee.org
Machine learning (ML) is playing an increasingly important role in data management tasks,
particularly in Data Integration and Preparation (DI&P). The success of ML-based …

Robust discovery of positive and negative rules in knowledge bases

S Ortona, VV Meduri, P Papotti - 2018 IEEE 34th International …, 2018 - ieeexplore.ieee.org
We present RUDIK, a system for the discovery of declarative rules over knowledge-bases
(KBs). RUDIK discovers rules that express positive relationships between entities, such as" if …

Slimfast: Guaranteed results for data fusion and source reliability

T Rekatsinas, M Joglekar, H Garcia-Molina… - Proceedings of the …, 2017 - dl.acm.org
We focus on data fusion, ie, the problem of unifying conflicting data from data sources into a
single representation by estimating the source accuracies. We propose SLiMFast, a …

Data profiling: A tutorial

Z Abedjan, L Golab, F Naumann - Proceedings of the 2017 ACM …, 2017 - dl.acm.org
is to understand the dataset at hand and its metadata. The process of metadata discovery is
known as data profiling. Profiling activities range from ad-hoc approaches, such as eye …

Pattern functional dependencies for data cleaning

A Qahtan, N Tang, M Ouzzani, Y Cao… - Proceedings of the …, 2020 - research.ed.ac.uk
Patterns (or regex-based expressions) are widely used to constrain the format of a domain
(or a column), eg, a Year column should contain only four digits, and thus a value like “1980 …