A survey on data selection for language models

A Albalak, Y Elazar, SM **e, S Longpre… - arxiv preprint arxiv …, 2024 - arxiv.org
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …

Dataset distillation with convexified implicit gradients

N Loo, R Hasani, M Lechner… - … Conference on Machine …, 2023 - proceedings.mlr.press
We propose a new dataset distillation algorithm using reparameterization and
convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art …

You only condense once: Two rules for pruning condensed datasets

Y He, L **ao, JT Zhou - Advances in Neural Information …, 2023 - proceedings.neurips.cc
Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size
of the training dataset, particularly in on-device scenarios. However, these scenarios have …

Towards sustainable learning: Coresets for data-efficient deep learning

Y Yang, H Kang… - … Conference on Machine …, 2023 - proceedings.mlr.press
To improve the efficiency and sustainability of learning deep models, we propose CREST,
the first scalable framework with rigorous theoretical guarantees to identify the most valuable …

A bounded ability estimation for computerized adaptive testing

Y Zhuang, Q Liu, GH Zhao, Z Huang… - Advances in …, 2024 - proceedings.neurips.cc
Computerized adaptive testing (CAT), as a tool that can efficiently measure student's ability,
has been widely used in various standardized tests (eg, GMAT and GRE). The adaptivity of …

Loss-curvature matching for dataset selection and condensation

S Shin, H Bae, D Shin, W Joo… - … Conference on Artificial …, 2023 - proceedings.mlr.press
Training neural networks on a large dataset requires substantial computational costs.
Dataset reduction selects or synthesizes data instances based on the large dataset, while …

D2 pruning: Message passing for balancing diversity and difficulty in data pruning

A Maharana, P Yadav, M Bansal - arxiv preprint arxiv:2310.07931, 2023 - arxiv.org
Analytical theories suggest that higher-quality data can lead to lower test errors in models
trained on a fixed data budget. Moreover, a model can be trained on a lower compute …

Data-centric green artificial intelligence: A survey

S Salehi, A Schmeink - IEEE Transactions on Artificial …, 2023 - ieeexplore.ieee.org
With the exponential growth of computational power and the availability of large-scale
datasets in recent years, remarkable advancements have been made in the field of artificial …

A survey of dataset refinement for problems in computer vision datasets

Z Wan, Z Wang, CT Chung, Z Wang - ACM computing surveys, 2024 - dl.acm.org
Large-scale datasets have played a crucial role in the advancement of computer vision.
However, they often suffer from problems such as class imbalance, noisy labels, dataset …

M3d: Dataset condensation by minimizing maximum mean discrepancy

H Zhang, S Li, P Wang, D Zeng, S Ge - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in
substantial training and storage costs. To address these challenges, dataset condensation …