Opendataval: a unified benchmark for data valuation

K Jiang, W Liang, JY Zou… - Advances in Neural …, 2023 - proceedings.neurips.cc
Assessing the quality and impact of individual data points is critical for improving model
performance and mitigating undesirable biases within the training dataset. Several data …

Performance scaling via optimal transport: Enabling data selection from partially revealed sources

F Kang, HA Just, AK Sahu, R Jia - Advances in Neural …, 2023 - proceedings.neurips.cc
Traditionally, data selection has been studied in settings where all samples from prospective
sources are fully revealed to a machine learning developer. However, in practical data …

Triage: Characterizing and auditing training data for improved regression

N Seedat, J Crabbé, Z Qian… - Advances in Neural …, 2023 - proceedings.neurips.cc
Data quality is crucial for robust machine learning algorithms, with the recent interest in data-
centric AI emphasizing the importance of training data characterization. However, current …

Data shapley in one training run

JT Wang, P Mittal, D Song, R Jia - arxiv preprint arxiv:2406.11011, 2024 - arxiv.org
Data Shapley provides a principled framework for attributing data's contribution within
machine learning contexts. However, existing approaches require re-training models on …

Rethinking data shapley for data selection tasks: Misleads and merits

JT Wang, T Yang, J Zou, Y Kwon, R Jia - arxiv preprint arxiv:2405.03875, 2024 - arxiv.org
Data Shapley provides a principled approach to data valuation and plays a crucial role in
data-centric machine learning (ML) research. Data selection is considered a standard …

Selectivity drives productivity: efficient dataset pruning for enhanced transfer learning

Y Zhang, Y Zhang, A Chen, J Jia, J Liu… - Advances in …, 2023 - proceedings.neurips.cc
Massive data is often considered essential for deep learning applications, but it also incurs
significant computational and infrastructural costs. Therefore, dataset pruning (DP) has …

Get more for less: Principled data selection for warming up fine-tuning in llms

F Kang, HA Just, Y Sun, H Jahagirdar, Y Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org
This work focuses on leveraging and selecting from vast, unlabeled, open data to pre-fine-
tune a pre-trained language model. The goal is to minimize the need for costly domain …

Distributionally robust data valuation

X Lin, X Xu, Z Wu, SK Ng, BKH Low - Forty-first International …, 2024 - openreview.net
Data valuation quantifies the contribution of each data point to the performance of a machine
learning model. Existing works typically define the value of data by its improvement of the …

Data valuation and detections in federated learning

W Li, S Fu, F Zhang, Y Pang - Proceedings of the IEEE/CVF …, 2024 - openaccess.thecvf.com
Federated Learning (FL) enables collaborative model training while preserving the privacy
of raw data. A challenge in this framework is the fair and efficient valuation of data which is …

Finding needles in a haystack: A black-box approach to invisible watermark detection

M Pan, Z Wang, X Dong, V Sehwag, L Lyu… - European Conference on …, 2024 - Springer
In this paper, we propose WaterMark Detector (WMD), the first invisible watermark detection
method under a black-box and annotation-free setting. WMD is capable of detecting arbitrary …