Recall: Membership inference via relative conditional log-likelihoods
The rapid scaling of large language models (LLMs) has raised concerns about the
transparency and fair use of the pretraining data used for training them. Detecting such …
transparency and fair use of the pretraining data used for training them. Detecting such …
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?
Large Language Models (LLMs) have demonstrated impressive capabilities in generating
diverse and contextually rich text. However, concerns regarding copyright infringement arise …
diverse and contextually rich text. However, concerns regarding copyright infringement arise …
Membership inference attacks cannot prove that a model was trained on your data
We consider the problem of a training data proof, where a data creator or owner wants to
demonstrate to a third party that some machine learning model was trained on their data …
demonstrate to a third party that some machine learning model was trained on their data …
Pretraining data detection for large language models: A divergence-based calibration method
As the scale of training corpora for large language models (LLMs) grows, model developers
become increasingly reluctant to disclose details on their data. This lack of transparency …
become increasingly reluctant to disclose details on their data. This lack of transparency …
Protecting privacy in multimodal large language models with mllmu-bench
Generative models such as Large Language Models (LLM) and Multimodal Large
Language models (MLLMs) trained on massive web corpora can memorize and disclose …
Language models (MLLMs) trained on massive web corpora can memorize and disclose …
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
The training data in large language models is key to their success, but it also presents
privacy and security risks, as it may contain sensitive information. Detecting pre-training data …
privacy and security risks, as it may contain sensitive information. Detecting pre-training data …
Detecting Training Data of Large Language Models via Expectation Maximization
The widespread deployment of large language models (LLMs) has led to impressive
advancements, yet information about their training data, a critical factor in their performance …
advancements, yet information about their training data, a critical factor in their performance …
Benchmark Data Contamination of Large Language Models: A Survey
The rapid development of Large Language Models (LLMs) like GPT-4, Claude-3, and
Gemini has transformed the field of natural language processing. However, it has also …
Gemini has transformed the field of natural language processing. However, it has also …
Mosaic Memory: Fuzzy Duplication in Copyright Traps for Large Language Models
The immense datasets used to develop Large Language Models (LLMs) often include
copyright-protected content, typically without the content creator's consent. Copyright traps …
copyright-protected content, typically without the content creator's consent. Copyright traps …
Investigating the Feasibility of Mitigating Potential Copyright Infringement via Large Language Model Unlearning
G Dou - arxiv preprint arxiv:2412.18621, 2024 - arxiv.org
Pre-trained Large Language Models (LLMs) have demonstrated remarkable capabilities but
also pose risks by learning and generating copyrighted material, leading to significant legal …
also pose risks by learning and generating copyrighted material, leading to significant legal …