Data-centric ai: Perspectives and challenges

D Zha, ZP Bhat, KH Lai, F Yang, X Hu - Proceedings of the 2023 SIAM …, 2023 - SIAM
The role of data in building AI systems has recently been significantly magnified by the
emerging concept of data-centric AI (DCAI), which advocates a fundamental shift from model …

[HTML][HTML] Self-training: A survey

MR Amini, V Feofanov, L Pauletto, L Hadjadj… - Neurocomputing, 2025 - Elsevier
Self-training methods have gained significant attention in recent years due to their
effectiveness in leveraging small labeled datasets and large unlabeled observations for …

A survey on programmatic weak supervision

J Zhang, CY Hsieh, Y Yu, C Zhang, A Ratner - arxiv preprint arxiv …, 2022 - arxiv.org
Labeling training data has become one of the major roadblocks to using machine learning.
Among various weak supervision paradigms, programmatic weak supervision (PWS) has …

WRENCH: A comprehensive benchmark for weak supervision

J Zhang, Y Yu, Y Li, Y Wang, Y Yang, M Yang… - arxiv preprint arxiv …, 2021 - arxiv.org
Recent Weak Supervision (WS) approaches have had widespread success in easing the
bottleneck of labeling training data for machine learning by synthesizing labels from multiple …

Theoretical analysis of weak-to-strong generalization

H Lang, D Sontag… - Advances in Neural …, 2025 - proceedings.neurips.cc
Strong student models can learn from weaker teachers: when trained on the predictions of a
weaker model, a strong pretrained student can learn to correct the weak model's errors and …

Meta self-training for few-shot neural sequence labeling

Y Wang, S Mukherjee, H Chu, Y Tu, M Wu… - Proceedings of the 27th …, 2021 - dl.acm.org
Neural sequence labeling is widely adopted for many Natural Language Processing (NLP)
tasks, such as Named Entity Recognition (NER) and slot tagging for dialog systems and …

Language models in the loop: Incorporating prompting into weak supervision

R Smith, JA Fries, B Hancock, SH Bach - ACM/JMS Journal of Data …, 2024 - dl.acm.org
We propose a new strategy for applying large pre-trained language models to novel tasks
when labeled training data is limited. Rather than apply the model in a typical zero-shot or …

Characterizing the impacts of semi-supervised learning for weak supervision

J Li, J Zhang, L Schmidt… - Advances in Neural …, 2023 - proceedings.neurips.cc
Labeling training data is a critical and expensive step in producing high accuracy ML
models, whether training from scratch or fine-tuning. To make labeling more efficient, two …

[HTML][HTML] MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning

T Alkhalifah, H Wang, O Ovcharenko - Artificial Intelligence in Geosciences, 2022 - Elsevier
Among the biggest challenges we face in utilizing neural networks trained on waveform (ie,
seismic, electromagnetic, or ultrasound) data is its application to real data. The requirement …

Training subset selection for weak supervision

H Lang, A Vijayaraghavan… - Advances in Neural …, 2022 - proceedings.neurips.cc
Existing weak supervision approaches use all the data covered by weak signals to train a
classifier. We show both theoretically and empirically that this is not always optimal …