Google Наука

Z Wei, Y Wang, A Li, Y Mo, Y Wang - arxiv preprint arxiv:2310.06387, 2023 - arxiv.org

Large Language Models (LLMs) have shown remarkable success in various tasks, yet their
safety and the risk of generating harmful content remain pressing concerns. In this paper, we …

Запазване Позоваване С позовавания в 217 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Jailbreaking large language models against moderation guardrails via cipher characters

H **, A Zhou, J Menke, H Wang - Advances in Neural …, 2025 - proceedings.neurips.cc

Abstract Large Language Models (LLMs) are typically harmless but remain vulnerable to
carefully crafted prompts known as``jailbreaks'', which can bypass protective measures and …

Запазване Позоваване С позовавания в 8 Сродни статии Всички 5 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

H **, L Hu, X Li, P Zhang, C Chen, J Zhuang… - arxiv preprint arxiv …, 2024 - arxiv.org

The rapid evolution of artificial intelligence (AI) through developments in Large Language
Models (LLMs) and Vision-Language Models (VLMs) has brought significant advancements …

Запазване Позоваване С позовавания в 31 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Z Zhang, Y Zhang, L Li, H Gao, L Wang, H Lu… - arxiv preprint arxiv …, 2024 - arxiv.org

Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound
capabilities in collective intelligence. However, the potential misuse of this intelligence for …

Запазване Позоваване С позовавания в 30 Сродни статии Всички 6 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models

C Qian, J Zhang, W Yao, D Liu, Z Yin, Y Qiao… - arxiv preprint arxiv …, 2024 - arxiv.org

Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies
concentrate on fully pre-trained LLMs to better understand and improve LLMs' …

Запазване Позоваване С позовавания в 17 Сродни статии Всички 5 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Identifying semantic induction heads to understand in-context learning

J Ren, Q Guo, H Yan, D Liu, Q Zhang, X Qiu… - arxiv preprint arxiv …, 2024 - arxiv.org

Although large language models (LLMs) have demonstrated remarkable performance, the
lack of transparency in their inference logic raises concerns about their trustworthiness. To …

Запазване Позоваване С позовавания в 23 Сродни статии Всички 5 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Adversarial tuning: Defending against jailbreak attacks for llms

F Liu, Z Xu, H Liu - arxiv preprint arxiv:2406.06622, 2024 - arxiv.org

Although safely enhanced Large Language Models (LLMs) have achieved remarkable
success in tackling various complex tasks in a zero-shot manner, they remain susceptible to …

Запазване Позоваване С позовавания в 10 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Q Ren, H Li, D Liu, Z **e, X Lu, Y Qiao, L Sha… - arxiv preprint arxiv …, 2024 - arxiv.org

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn
interactions, where malicious users can obscure harmful intents across several queries. We …

Запазване Позоваване С позовавания в 11 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Contextual api completion for unseen repositories using llms

N Nashid, T Shabani, P Alian, A Mesbah - arxiv preprint arxiv:2405.04600, 2024 - arxiv.org

Large language models have made substantial progress in addressing diverse code-related
tasks. However, their adoption is hindered by inconsistencies in generating output due to the …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Vlsbench: Unveiling visual leakage in multimodal safety

X Hu, D Liu, H Li, X Huang, J Shao - arxiv preprint arxiv:2411.19939, 2024 - arxiv.org

Safety concerns of Multimodal large language models (MLLMs) have gradually become an
important problem in various applications. Surprisingly, previous works indicate a counter …

Запазване Позоваване С позовавания в 8 Сродни статии Всички 2 версии Във вид на HTML

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Codeattack: Revealing safety generalization challenges of large language models via code completion

Jailbreak and guard aligned language models with only few in-context demonstrations

Jailbreaking large language models against moderation guardrails via cipher characters

Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models

Psysafe: A comprehensive framework for psychological-based attack, defense, and evaluation of multi-agent system safety

Towards tracing trustworthiness dynamics: Revisiting pre-training period of large language models

Identifying semantic induction heads to understand in-context learning

Adversarial tuning: Defending against jailbreak attacks for llms

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

Contextual api completion for unseen repositories using llms

Vlsbench: Unveiling visual leakage in multimodal safety