Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models

Y Qin, Y Yang, P Guo, G Li, H Shao, Y Shi, Z Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Instruction tuning plays a critical role in aligning large language models (LLMs) with human
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …

Data Valuation and Detections in Federated Learning

W Li, S Fu, F Zhang, Y Pang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Federated Learning (FL) enables collaborative model training while preserving the privacy
of raw data. A challenge in this framework is the fair and efficient valuation of data which is …

Data acquisition: A new frontier in data-centric AI

L Chen, B Acun, N Ardalani, Y Sun, F Kang… - arxiv preprint arxiv …, 2023 - arxiv.org
As Machine Learning (ML) systems continue to grow, the demand for relevant and
comprehensive datasets becomes imperative. There is limited study on the challenges of …

A Survey on Data Markets

J Zhang, Y Bi, M Cheng, J Liu, K Ren, Q Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Data is the new oil of the 21st century. The growing trend of trading data for greater welfare
has led to the emergence of data markets. A data market is any mechanism whereby the …

Autoscale: Automatic prediction of compute-optimal data composition for training llms

F Kang, Y Sun, B Wen, S Chen, D Song… - arxiv preprint arxiv …, 2024 - arxiv.org
Domain reweighting is an emerging research area aimed at adjusting the relative weights of
different data sources to improve the effectiveness and efficiency of language model pre …

An Adaptive Pricing Framework for Real-Time AI Model Service Exchange

J Gao, Z Wang, X Wei - IEEE Transactions on Network Science …, 2024 - ieeexplore.ieee.org
Artificial intelligence (AI) model services offer remarkable efficiency and automation,
engaging customers across various tasks. However, not all AI consumers possess sufficient …

TAROT: Targeted Data Selection via Optimal Transport

L Feng, F Nie, Y Liu, A Alahi - arxiv preprint arxiv:2412.00420, 2024 - arxiv.org
We propose TAROT, a targeted data selection framework grounded in optimal transport
theory. Previous targeted data selection methods primarily rely on influence-based greedy …

MYCROFT: Towards Effective and Efficient External Data Augmentation

Z Sarwar, V Tran, AN Bhagoji, N Feamster… - arxiv preprint arxiv …, 2024 - arxiv.org
Machine learning (ML) models often require large amounts of data to perform well. When the
available data is limited, model trainers may need to acquire more data from external …

Fair Classification with Partial Feedback: An Exploration-Based Data-Collection Approach

V Keswani, A Mehrotra, LE Celis - arxiv preprint arxiv:2402.11338, 2024 - arxiv.org
In many predictive contexts (eg, credit lending), true outcomes are only observed for
samples that were positively classified in the past. These past observations, in turn, form …

Private Wasserstein Distance with Random Noises

W Li, H Wang, Z Huang, Y Pang - arxiv preprint arxiv:2404.06787, 2024 - arxiv.org
Wasserstein distance is a principle measure of data divergence from a distributional
standpoint. However, its application becomes challenging in the context of data privacy …