- Academic Search

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org

Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

保存引用被引用数: 11 関連記事全 3 バージョン HTMLバージョン

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

X Liu, Y Zhu, J Gu, Y Lan, C Yang, Y Qiao - European Conference on …, 2024 - Springer

The security concerns surrounding Large Language Models (LLMs) have been extensively
explored, yet the safety of Multimodal Large Language Models (MLLMs) remains …

保存引用被引用数: 43 関連記事全 3 バージョン

[Free GPT-4]

[PDF] openreview.net

The unlocking spell on base llms: Rethinking alignment via in-context learning

BY Lin, A Ravichander, X Lu, N Dziri… - The Twelfth …, 2023 - openreview.net

Alignment tuning has become the de facto standard practice for enabling base large
language models (LLMs) to serve as open-domain AI assistants. The alignment tuning …

保存引用被引用数: 115 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Y Zeng, H Lin, J Zhang, D Yang, R Jia… - arxiv preprint arxiv …, 2024 - arxiv.org

Most traditional AI safety research has approached AI models as machines and centered on
algorithm-focused attacks developed by security experts. As large language models (LLMs) …

保存引用被引用数: 184 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Defending large language models against jailbreaking attacks through goal prioritization

Z Zhang, J Yang, P Ke, F Mi, H Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

While significant attention has been dedicated to exploiting weaknesses in LLMs through
jailbreaking attacks, there remains a paucity of effort in defending against these attacks. We …

保存引用被引用数: 69 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Llm self defense: By self examination, llms know they are being tricked

M Phute, A Helbling, M Hull, SY Peng, S Szyller… - arxiv preprint arxiv …, 2023 - arxiv.org

Large language models (LLMs) are popular for high-quality text generation but can produce
harmful content, even when aligned with human values through reinforcement learning …

保存引用被引用数: 84 関連記事全 2 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Low-resource languages jailbreak gpt-4

ZX Yong, C Menghini, SH Bach - arxiv preprint arxiv:2310.02446, 2023 - arxiv.org

AI safety training and red-teaming of large language models (LLMs) are measures to
mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual …

保存引用被引用数: 162 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Shadow alignment: The ease of subverting safely-aligned language models

X Yang, X Wang, Q Zhang, L Petzold, WY Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

Warning: This paper contains examples of harmful language, and reader discretion is
recommended. The increasing open release of powerful large language models (LLMs) has …

保存引用被引用数: 117 関連記事全 4 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu… - arxiv preprint arxiv …, 2024 - arxiv.org

Automated red teaming holds substantial promise for uncovering and mitigating the risks
associated with the malicious use of large language models (LLMs), yet the field lacks a …

保存引用被引用数: 164 関連記事全 3 バージョン HTMLバージョン

[Free GPT-4]

[PDF] arxiv.org

Jailbreak attacks and defenses against large language models: A survey

S Yi, Y Liu, Z Sun, T Cong, X He, J Song, K Xu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have performed exceptionally in various text-generative
tasks, including question answering, translation, code completion, etc. However, the over …

保存引用被引用数: 33 関連記事全 3 バージョン HTMLバージョン

アラートを作成

引用

検索オプション

マイライブラリに保存しました

Rain: Your language models can align themselves without finetuning

From generation to judgment: Opportunities and challenges of llm-as-a-judge

Mm-safetybench: A benchmark for safety evaluation of multimodal large language models

The unlocking spell on base llms: Rethinking alignment via in-context learning

How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

Defending large language models against jailbreaking attacks through goal prioritization

Llm self defense: By self examination, llms know they are being tricked

Low-resource languages jailbreak gpt-4

Shadow alignment: The ease of subverting safely-aligned language models

Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

Jailbreak attacks and defenses against large language models: A survey