Deep learning on a data diet: Finding important examples early in training

M Paul, S Ganguli… - Advances in neural …, 2021 - proceedings.neurips.cc
Recent success in deep learning has partially been driven by training increasingly
overparametrized networks on ever larger datasets. It is therefore natural to ask: how much …

Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models

Y Qin, Y Yang, P Guo, G Li, H Shao, Y Shi, Z Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Instruction tuning plays a critical role in aligning large language models (LLMs) with human
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …

Modyn: A platform for model training on dynamic datasets with sample-level data selection

M Böther, V Gsteiger, T Robroek, A Klimovic - arxiv preprint arxiv …, 2023 - arxiv.org
Machine learning training data is often dynamic in real-world use cases, ie, data is added or
removed and may experience distribution shifts over time. Models must incorporate this …

Advancing deep active learning & data subset selection: Unifying principles with information-theory intuitions

A Kirsch - arxiv preprint arxiv:2401.04305, 2024 - arxiv.org
At its core, this thesis aims to enhance the practicality of deep learning by improving the
label and training efficiency of deep learning models. To this end, we investigate data subset …

Efficient and Robust Quantization-aware Training via Adaptive Coreset Selection

X Huang, Z Liu, SY Liu, KT Cheng - arxiv preprint arxiv:2306.07215, 2023 - arxiv.org
Quantization-aware training (QAT) is a representative model compression method to reduce
redundancy in weights and activations. However, most existing QAT methods require end-to …

Modyn: Data-Centric Machine Learning Pipeline Orchestration

M Böther, T Robroek, V Gsteiger, R Holzinger… - Proceedings of the …, 2025 - dl.acm.org
In real-world machine learning (ML) pipelines, datasets are continuously growing. Models
must incorporate this new training data to improve generalization and adapt to potential …

Robust and Efficient Quantization-aware Training via Coreset Selection

X Huang, Z Liu, SY Liu, KT Cheng - Transactions on Machine Learning … - openreview.net
Quantization-aware training (QAT) is a representative model compression method to reduce
redundancy in weights and activations. However, most existing QAT methods require end-to …