Unleashing the power of data tsunami: A comprehensive survey on data assessment and selection for instruction tuning of language models
Instruction tuning plays a critical role in aligning large language models (LLMs) with human
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …
preference. Despite the vast amount of open instruction datasets, naively training a LLM on …
Data Valuation and Detections in Federated Learning
Federated Learning (FL) enables collaborative model training while preserving the privacy
of raw data. A challenge in this framework is the fair and efficient valuation of data which is …
of raw data. A challenge in this framework is the fair and efficient valuation of data which is …
Data acquisition: A new frontier in data-centric AI
As Machine Learning (ML) systems continue to grow, the demand for relevant and
comprehensive datasets becomes imperative. There is limited study on the challenges of …
comprehensive datasets becomes imperative. There is limited study on the challenges of …
A Survey on Data Markets
Data is the new oil of the 21st century. The growing trend of trading data for greater welfare
has led to the emergence of data markets. A data market is any mechanism whereby the …
has led to the emergence of data markets. A data market is any mechanism whereby the …
Autoscale: Automatic prediction of compute-optimal data composition for training llms
Domain reweighting is an emerging research area aimed at adjusting the relative weights of
different data sources to improve the effectiveness and efficiency of language model pre …
different data sources to improve the effectiveness and efficiency of language model pre …
An Adaptive Pricing Framework for Real-Time AI Model Service Exchange
J Gao, Z Wang, X Wei - IEEE Transactions on Network Science …, 2024 - ieeexplore.ieee.org
Artificial intelligence (AI) model services offer remarkable efficiency and automation,
engaging customers across various tasks. However, not all AI consumers possess sufficient …
engaging customers across various tasks. However, not all AI consumers possess sufficient …
TAROT: Targeted Data Selection via Optimal Transport
We propose TAROT, a targeted data selection framework grounded in optimal transport
theory. Previous targeted data selection methods primarily rely on influence-based greedy …
theory. Previous targeted data selection methods primarily rely on influence-based greedy …
MYCROFT: Towards Effective and Efficient External Data Augmentation
Machine learning (ML) models often require large amounts of data to perform well. When the
available data is limited, model trainers may need to acquire more data from external …
available data is limited, model trainers may need to acquire more data from external …
Fair Classification with Partial Feedback: An Exploration-Based Data-Collection Approach
In many predictive contexts (eg, credit lending), true outcomes are only observed for
samples that were positively classified in the past. These past observations, in turn, form …
samples that were positively classified in the past. These past observations, in turn, form …
Private Wasserstein Distance with Random Noises
W Li, H Wang, Z Huang, Y Pang - arxiv preprint arxiv:2404.06787, 2024 - arxiv.org
Wasserstein distance is a principle measure of data divergence from a distributional
standpoint. However, its application becomes challenging in the context of data privacy …
standpoint. However, its application becomes challenging in the context of data privacy …