Google Наука

X Guo, F Yu, H Zhang, L Qin, B Hu - arxiv preprint arxiv:2402.08679, 2024 - arxiv.org

Jailbreaks on large language models (LLMs) have recently received increasing attention.
For a comprehensive assessment of LLM safety, it is essential to consider jailbreaks with …

Запазване Позоваване С позовавания в 68 Сродни статии Всички 9 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts

ZY Chin, CM Jiang, CC Huang, PY Chen… - arxiv preprint arxiv …, 2023 - arxiv.org

Text-to-image diffusion models, eg Stable Diffusion (SD), lately have shown remarkable
ability in high-quality content generation, and become one of the representatives for the …

Запазване Позоваване С позовавания в 62 Сродни статии Всички 8 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] jair.org Full View

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

L Lin, H Mu, Z Zhai, M Wang, Y Wang, R Wang… - Journal of Artificial …, 2025 - jair.org

Generative models are rapidly gaining popularity and being integrated into everyday
applications, raising concerns over their safe use as various vulnerabilities are exposed. In …

Запазване Позоваване С позовавания в 15 Сродни статии Всички 6 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Flirt: Feedback loop in-context red teaming

N Mehrabi, P Goyal, C Dupuy, Q Hu, S Ghosh… - arxiv preprint arxiv …, 2023 - arxiv.org

Warning: this paper contains content that may be inappropriate or offensive. As generative
models become available for public use in various applications, testing and analyzing …

Запазване Позоваване С позовавания в 53 Сродни статии Всички 7 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Trustworthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations

C Chen, X Gong, Z Liu, W Jiang, SQ Goh… - arxiv preprint arxiv …, 2024 - arxiv.org

AI Safety is an emerging area of critical importance to the safe adoption and deployment of
AI systems. With the rapid proliferation of AI and especially with the recent advancement of …

Запазване Позоваване С позовавания в 7 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Exploring safety-utility trade-offs in personalized language models

AR Vij**i, SBR Chowdhury, S Chaturvedi - arxiv preprint arxiv …, 2024 - arxiv.org

As large language models (LLMs) become increasingly integrated into daily applications, it
is essential to ensure they operate fairly across diverse user demographics. In this work, we …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

A Beutel, K **ao, J Heidecke, L Weng - arxiv preprint arxiv:2412.18693, 2024 - arxiv.org

Automated red teaming can discover rare model failures and generate challenging
examples that can be used for training or evaluation. However, a core challenge in …

Запазване Позоваване С позовавания в 2 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Asetf: A novel method for jailbreak attack on llms through translate suffix embeddings

H Wang, H Li, M Huang, L Sha - arxiv preprint arxiv:2402.16006, 2024 - arxiv.org

The safety defense methods of Large language models (LLMs) stays limited because the
dangerous prompts are manually curated to just few known attack types, which fails to keep …

Запазване Позоваване С позовавания в 7 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Impact of non-standard unicode characters on security and comprehension in large language models

JS Daniel, A Pal - arxiv preprint arxiv:2405.14490, 2024 - arxiv.org

The advancement of large language models has significantly improved natural language
processing. However, challenges such as jailbreaks (prompt injections that cause an LLM to …

Запазване Позоваване С позовавания в 2 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

J Nöther, A Singla, G Radanović - arxiv preprint arxiv:2501.08246, 2025 - arxiv.org

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of
a given target large language model (LLM). These methods use red-teaming LLMs to …

Запазване Позоваване Сродни статии Всички 3 версии Във вид на HTML

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Gradient-based language model red teaming

Cold-attack: Jailbreaking llms with stealthiness and controllability

Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts

Against The Achilles' Heel: A Survey on Red Teaming for Generative Models

Flirt: Feedback loop in-context red teaming

Trustworthy, responsible, and safe ai: A comprehensive architectural framework for ai safety with challenges and mitigations

Exploring safety-utility trade-offs in personalized language models

Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

Asetf: A novel method for jailbreak attack on llms through translate suffix embeddings

Impact of non-standard unicode characters on security and comprehension in large language models

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints