[HTML][HTML] Automated data processing and feature engineering for deep learning and big data applications: a survey

A Mumuni, F Mumuni - Journal of Information and Intelligence, 2024 - Elsevier
Modern approach to artificial intelligence (AI) aims to design algorithms that learn directly
from data. This approach has achieved impressive results and has contributed significantly …

Learning from data with structured missingness

R Mitra, SF McGough, T Chakraborti… - Nature Machine …, 2023 - nature.com
Missing data are an unavoidable complication in many machine learning tasks. When data
are 'missing at random'there exist a range of tools and techniques to deal with the issue …

The effects of data quality on machine learning performance

L Budach, M Feuerpfeil, N Ihde, A Nathansen… - arxiv preprint arxiv …, 2022 - arxiv.org
Modern artificial intelligence (AI) applications require large quantities of training and test
data. This need creates critical challenges not only concerning the availability of such data …

Pervasive label errors in test sets destabilize machine learning benchmarks

CG Northcutt, A Athalye, J Mueller - arxiv preprint arxiv:2103.14749, 2021 - arxiv.org
We identify label errors in the test sets of 10 of the most commonly-used computer vision,
natural language, and audio datasets, and subsequently study the potential for these label …

Dataperf: Benchmarks for data-centric ai development

M Mazumder, C Banbury, X Yao… - Advances in …, 2023 - proceedings.neurips.cc
Abstract Machine learning research has long focused on models rather than datasets, and
prominent datasets are used for common ML tasks without regard to the breadth, difficulty …

[HTML][HTML] A procedure for anomaly detection and analysis

O Koren, M Koren, O Peretz - Engineering Applications of Artificial …, 2023 - Elsevier
Anomaly detection is often used to identify and remove outliers in datasets. However,
detecting and analyzing the pattern of outliers can contribute to future business decisions or …

UniDM: a Unified framework for data manipulation with large language models

Y Qian, Y He, R Zhu, J Huang, Z Ma… - Proceedings of …, 2024 - proceedings.mlsys.org
Designing effective data manipulation methods is a long standing problem in data lakes.
Traditional methods, which rely on rules or machine learning models, require extensive …

Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation

R Wang, Y Li, J Wang - 2023 IEEE 39th International …, 2023 - ieeexplore.ieee.org
Machine learning (ML) is playing an increasingly important role in data management tasks,
particularly in Data Integration and Preparation (DI&P). The success of ML-based …

Navigating data-centric artificial intelligence with DC-Check: Advances, challenges, and opportunities

N Seedat, F Imrie… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Data-centric artificial intelligence (AI) is an emerging paradigm that emphasizes the critical
role of data in real-world machine learning (ML) systems—as a complement to model …

Machine learning-assisted data filtering and QSAR models for prediction of chemical acute toxicity on rat and mouse

T Bo, Y Lin, J Han, Z Hao, J Liu - Journal of Hazardous Materials, 2023 - Elsevier
Abstract Machine learning (ML) methods provide a new opportunity to build quantitative
structure-activity relationship (QSAR) models for predicting chemicals' toxicity based on …