Towards CRISP-ML (Q): a machine learning process model with quality assurance methodology

S Studer, TB Bui, C Drescher, A Hanuschkin… - Machine learning and …, 2021 - mdpi.com
Machine learning is an established and frequently used technique in industry and
academia, but a standard process model to improve success and efficiency of machine …

Machine learning and data cleaning: Which serves the other?

IF Ilyas, T Rekatsinas - ACM Journal of Data and Information Quality …, 2022 - dl.acm.org
The last few years witnessed significant advances in building automated or semi-automated
data quality, data cleaning and data integration systems powered by machine learning (ML) …

[PDF][PDF] From Cleaning before ML to Cleaning for ML.

F Neutatz, B Chen, Z Abedjan, E Wu - IEEE Data Eng. Bull., 2021 - scholar.archive.org
Data cleaning is widely regarded as a critical piece of machine learning (ML) applications,
as data errors can corrupt models in ways that cause the application to operate incorrectly …

Angler: Hel** machine translation practitioners prioritize model improvements

S Robertson, ZJ Wang, D Moritz, MB Kery… - Proceedings of the 2023 …, 2023 - dl.acm.org
Machine learning (ML) models can fail in unexpected ways in the real world, but not all
model failures are equal. With finite time and resources, ML practitioners are forced to …

SAGA: a scalable framework for optimizing data cleaning pipelines for machine learning applications

S Siddiqi, R Kern, M Boehm - Proceedings of the ACM on Management …, 2023 - dl.acm.org
In the exploratory data science lifecycle, data scientists often spent the majority of their time
finding, integrating, validating and cleaning relevant datasets. Despite recent work on data …

[PDF][PDF] Automating Data Quality Validation for Dynamic Data Ingestion.

S Redyuk, Z Kaoudi, V Markl, S Schelter - EDBT, 2021 - sergred.github.io
Data quality validation is a crucial step in modern data-driven applications. Errors in the data
lead to unexpected behavior of production pipelines and downstream services, such as …

Picket: guarding against corrupted data in tabular data during learning and inference

Z Liu, Z Zhou, T Rekatsinas - The VLDB Journal, 2022 - Springer
Data corruption is an impediment to modern machine learning deployments. Corrupted data
can severely bias the learned model and can also lead to invalid inferences. We present …

SEDAR: a semantic data reservoir for heterogeneous datasets

S Hoseini, A Ali, H Shaker, C Quix - Proceedings of the 32nd ACM …, 2023 - dl.acm.org
Data lakes have emerged as a solution for managing vast and diverse datasets for modern
data analytics. To prevent them from becoming ungoverned, semantic data management …

Auto-validate: Unsupervised data validation using data-domain patterns inferred from data lakes

J Song, Y He - Proceedings of the 2021 International Conference on …, 2021 - dl.acm.org
Complex data pipelines are increasingly common in diverse applications such as BI
reporting and ML modeling. These pipelines often recur regularly (eg, daily or weekly), as BI …