Agent smith: A single image can jailbreak one million multimodal llm agents exponentially fast

X Gu, X Zheng, T Pang, C Du, Q Liu, Y Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
A multimodal large language model (MLLM) agent can receive instructions, capture images,
retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming …

Decomposing and editing predictions by modeling model computation

H Shah, A Ilyas, A Madry - arxiv preprint arxiv:2404.11534, 2024 - arxiv.org
How does the internal computation of a machine learning model transform inputs into
predictions? In this paper, we introduce a task called component modeling that aims to …

Data attribution for text-to-image models by unlearning synthesized images

SY Wang, A Hertzmann, A Efros… - Advances in Neural …, 2025 - proceedings.neurips.cc
The goal of data attribution for text-to-image models is to identify the training images that
most influence the generation of a new image. Influence is defined such that, for a given …

Finding nemo: Localizing neurons responsible for memorization in diffusion models

D Hintersdorf, L Struppek, K Kersting… - Advances in …, 2025 - proceedings.neurips.cc
Diffusion models (DMs) produce very detailed and high-quality images. Their power results
from extensive training on large amounts of data-usually scraped from the internet without …

Training data attribution via approximate unrolled differentiation

J Bae, W Lin, J Lorraine, R Grosse - arxiv preprint arxiv:2405.12186, 2024 - arxiv.org
Many training data attribution (TDA) methods aim to estimate how a model's behavior would
change if one or more data points were removed from the training set. Methods based on …

An economic solution to copyright challenges of generative ai

JT Wang, Z Deng, H Chiba-Okabe, B Barak… - arxiv preprint arxiv …, 2024 - arxiv.org
Generative artificial intelligence (AI) systems are trained on large data corpora to generate
new pieces of text, images, videos, and other media. There is growing concern that such …

Diffusion PID: Interpreting Diffusion via Partial Information Decomposition

S Dewan, R Zawar, P Saxena… - Advances in Neural …, 2025 - proceedings.neurips.cc
Text-to-image diffusion models have made significant progress in generating naturalistic
images from textual inputs, and demonstrate the capacity to learn and represent complex …

Towards user-focused research in training data attribution for human-centered explainable ai

E Nguyen, J Bertram, E Kortukov, JY Song… - arxiv preprint arxiv …, 2024 - arxiv.org
While Explainable AI (XAI) aims to make AI understandable and useful to humans, it has
been criticised for relying too much on formalism and solutionism, focusing more on …

A survey of defenses against ai-generated visual media: Detection, disruption, and authentication

J Deng, C Lin, Z Zhao, S Liu, Q Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Deep generative models have demonstrated impressive performance in various computer
vision applications, including image synthesis, video generation, and medical analysis …

Most influential subset selection: Challenges, promises, and beyond

Y Hu, P Hu, H Zhao, JW Ma - arxiv preprint arxiv:2409.18153, 2024 - arxiv.org
How can we attribute the behaviors of machine learning models to their training data? While
the classic influence function sheds light on the impact of individual samples, it often fails to …