A survey on data selection for language models
A major factor in the recent success of large language models is the use of enormous and
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
ever-growing text datasets for unsupervised pre-training. However, naively training a model …
Dataset distillation with convexified implicit gradients
We propose a new dataset distillation algorithm using reparameterization and
convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art …
convexification of implicit gradients (RCIG), that substantially improves the state-of-the-art …
You only condense once: Two rules for pruning condensed datasets
Dataset condensation is a crucial tool for enhancing training efficiency by reducing the size
of the training dataset, particularly in on-device scenarios. However, these scenarios have …
of the training dataset, particularly in on-device scenarios. However, these scenarios have …
Towards sustainable learning: Coresets for data-efficient deep learning
To improve the efficiency and sustainability of learning deep models, we propose CREST,
the first scalable framework with rigorous theoretical guarantees to identify the most valuable …
the first scalable framework with rigorous theoretical guarantees to identify the most valuable …
A bounded ability estimation for computerized adaptive testing
Computerized adaptive testing (CAT), as a tool that can efficiently measure student's ability,
has been widely used in various standardized tests (eg, GMAT and GRE). The adaptivity of …
has been widely used in various standardized tests (eg, GMAT and GRE). The adaptivity of …
Loss-curvature matching for dataset selection and condensation
Training neural networks on a large dataset requires substantial computational costs.
Dataset reduction selects or synthesizes data instances based on the large dataset, while …
Dataset reduction selects or synthesizes data instances based on the large dataset, while …
D2 pruning: Message passing for balancing diversity and difficulty in data pruning
Analytical theories suggest that higher-quality data can lead to lower test errors in models
trained on a fixed data budget. Moreover, a model can be trained on a lower compute …
trained on a fixed data budget. Moreover, a model can be trained on a lower compute …
Data-centric green artificial intelligence: A survey
With the exponential growth of computational power and the availability of large-scale
datasets in recent years, remarkable advancements have been made in the field of artificial …
datasets in recent years, remarkable advancements have been made in the field of artificial …
A survey of dataset refinement for problems in computer vision datasets
Large-scale datasets have played a crucial role in the advancement of computer vision.
However, they often suffer from problems such as class imbalance, noisy labels, dataset …
However, they often suffer from problems such as class imbalance, noisy labels, dataset …
M3d: Dataset condensation by minimizing maximum mean discrepancy
Training state-of-the-art (SOTA) deep models often requires extensive data, resulting in
substantial training and storage costs. To address these challenges, dataset condensation …
substantial training and storage costs. To address these challenges, dataset condensation …