Recall: Membership inference via relative conditional log-likelihoods

R **e, J Wang, R Huang, M Zhang, R Ge, J Pei… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid scaling of large language models (LLMs) has raised concerns about the
transparency and fair use of the pretraining data used for training them. Detecting such …

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?

MA Panaitescu-Liess, Z Che, B An, Y Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated impressive capabilities in generating
diverse and contextually rich text. However, concerns regarding copyright infringement arise …

Membership inference attacks cannot prove that a model was trained on your data

J Zhang, D Das, G Kamath, F Tramèr - arxiv preprint arxiv:2409.19798, 2024 - arxiv.org
We consider the problem of a training data proof, where a data creator or owner wants to
demonstrate to a third party that some machine learning model was trained on their data …

Pretraining data detection for large language models: A divergence-based calibration method

W Zhang, R Zhang, J Guo, M de Rijke, Y Fan… - arxiv preprint arxiv …, 2024 - arxiv.org
As the scale of training corpora for large language models (LLMs) grows, model developers
become increasingly reluctant to disclose details on their data. This lack of transparency …

Protecting privacy in multimodal large language models with mllmu-bench

Z Liu, G Dou, M Jia, Z Tan, Q Zeng, Y Yuan… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative models such as Large Language Models (LLM) and Multimodal Large
Language models (MLLMs) trained on massive web corpora can memorize and disclose …

Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding

C Wang, Y Wang, B Hooi, Y Cai, N Peng… - arxiv preprint arxiv …, 2024 - arxiv.org
The training data in large language models is key to their success, but it also presents
privacy and security risks, as it may contain sensitive information. Detecting pre-training data …

Detecting Training Data of Large Language Models via Expectation Maximization

G Kim, Y Li, E Spiliopoulou, J Ma, M Ballesteros… - arxiv preprint arxiv …, 2024 - arxiv.org
The widespread deployment of large language models (LLMs) has led to impressive
advancements, yet information about their training data, a critical factor in their performance …

Benchmark Data Contamination of Large Language Models: A Survey

C Xu, S Guan, D Greene, M Kechadi - arxiv preprint arxiv:2406.04244, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …

Mosaic Memory: Fuzzy Duplication in Copyright Traps for Large Language Models

I Shilov, M Meeus, YA de Montjoye - arxiv preprint arxiv:2405.15523, 2024 - arxiv.org
The immense datasets used to develop Large Language Models (LLMs) often include
copyright-protected content, typically without the content creator's consent. Copyright traps …

Investigating the Feasibility of Mitigating Potential Copyright Infringement via Large Language Model Unlearning

G Dou - arxiv preprint arxiv:2412.18621, 2024 - arxiv.org
Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but
also pose risks by learning and generating copyrighted material, leading to significant legal …