- Academic Search

Z Zhang, Y Lu, J Ma, D Zhang, R Li, P Ke, H Sun… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The safety of Large Language Models (LLMs) has gained increasing attention in recent
years, but there still lacks a comprehensive approach for detecting safety issues within …‏

שמור צטט צוטט על ידי 22 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Detoxifying large language models via knowledge editing‏

M Wang, N Zhang, Z Xu, Z **, S Deng, Y Yao… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

This paper investigates using knowledge editing techniques to detoxify Large Language
Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories …‏

שמור צטט צוטט על ידי 9 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

High-dimension human value representation in large language models‏

S Cahyawijaya, D Chen, Y Bang, L Khalatbari… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

The widespread application of Large Language Models (LLMs) across various tasks and
fields has necessitated the alignment of these models with human values and preferences …‏

שמור צטט צוטט על ידי 3 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context‏

N Das, E Raff, M Gaur - arxiv preprint arxiv:2412.16359, 2024‏ - arxiv.org‏

Previous research on LLM vulnerabilities often relied on nonsensical adversarial prompts,
which were easily detectable by automated methods. We address this gap by focusing on …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context‏

N Das, E Raff, M Gaur - arxiv preprint arxiv:2407.14644, 2024‏ - arxiv.org‏

Previous research on testing the vulnerabilities in Large Language Models (LLMs) using
adversarial attacks has primarily focused on nonsensical prompt injections, which are easily …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

Detoxifying Large Language Models via Kahneman-Tversky Optimization‏

Q Li, W Du, J Liu - CCF International Conference on Natural Language …, 2024‏ - Springer‏

Currently, the application of Large Language Models (LLMs) faces significant security
threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate …‏

שמור צטט מאמרים בנושא זה כל 2 הגרסאות

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

InstructSafety: a unified framework for building multidimensional and explainable safety...

Shieldlm: Empowering llms as aligned, customizable and explainable safety detectors‏

Detoxifying large language models via knowledge editing‏

High-dimension human value representation in large language models‏

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context‏

Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context‏

Detoxifying Large Language Models via Kahneman-Tversky Optimization‏