Google Acadèmic

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

Agentharm: A benchmark for measuring harmfulness of llm agents

M Andriushchenko, A Souly, M Dziemian… - arxiv preprint arxiv …, 2024 - arxiv.org

The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent
safety measures and misuse model capabilities, has been studied primarily for LLMs acting …

Desa Cita Citat per 15 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

H Yang, L Qu, E Shareghi, G Haffari - arxiv preprint arxiv:2410.11459, 2024 - arxiv.org

Large language models (LLMs) have exhibited outstanding performance in engaging with
humans and addressing complex questions by leveraging their vast implicit knowledge and …

Desa Cita Citat per 1 Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

You Know What I'm Saying: Jailbreak Attack via Implicit Reference

T Wu, L Mei, R Yuan, L Li, W Xue, Y Guo - arxiv preprint arxiv:2410.03857, 2024 - arxiv.org

While recent advancements in large language model (LLM) alignment have enabled the
effective identification of malicious objectives involving scene nesting and keyword rewriting …

Desa Cita Articles relacionats Totes les 3 versions Free GPT-4 DeepSeek Versió HTML

Cita

Cerca avançada

S'ha desat a La meva biblioteca

Agentharm: A benchmark for measuring harmfulness of llm agents

Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models

You Know What I'm Saying: Jailbreak Attack via Implicit Reference