- Academic Search

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …‏

שמור צטט צוטט על ידי 247 מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Discovering language model behaviors with model-written evaluations‏

E Perez, S Ringer, K Lukosiute, K Nguyen… - Findings of the …, 2023‏ - aclanthology.org‏

As language models (LMs) scale, they develop many novel behaviors, good and bad,
exacerbating the need to evaluate how they behave. Prior work creates evaluations with …‏

שמור צטט צוטט על ידי 257 מאמרים בנושא זה כל 13 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Attack prompt generation for red teaming and defending large language models‏

B Deng, W Wang, F Feng, Y Deng, Q Wang… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Large language models (LLMs) are susceptible to red teaming attacks, which can induce
LLMs to generate harmful content. Previous research constructs attack prompts via manual …‏

שמור צטט צוטט על ידי 75 מאמרים בנושא זה כל 6 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gaining wisdom from setbacks: Aligning large language models via mistake analysis‏

K Chen, C Wang, K Yang, J Han, L Hong, F Mi… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

The rapid development of large language models (LLMs) has not only provided numerous
opportunities but also presented significant challenges. This becomes particularly evident …‏

שמור צטט צוטט על ידי 32 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Autodetect: Towards a unified framework for automated weakness detection in large language models‏

J Cheng, Y Lu, X Gu, P Ke, X Liu, Y Dong… - arxiv preprint arxiv …, 2024‏ - arxiv.org‏

Although Large Language Models (LLMs) are becoming increasingly powerful, they still
exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 5 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

InstructSafety: a unified framework for building multidimensional and explainable safety detector through instruction tuning‏

Z Zhang, J Cheng, H Sun, J Deng… - Findings of the …, 2023‏ - aclanthology.org‏

Safety detection has been an increasingly important topic in recent years and it has become
even more necessary to develop reliable safety detection systems with the rapid …‏

שמור צטט צוטט על ידי 6 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards safer generative language models: A survey on safety risks, evaluations, and improvements‏

J Deng, J Cheng, H Sun, Z Zhang, M Huang - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

As generative large model capabilities advance, safety concerns become more pronounced
in their outputs. To ensure the sustainable growth of the AI ecosystem, it's imperative to …‏

שמור צטט צוטט על ידי 9 מאמרים בנושא זה כל 2 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

An Auditing Test to Detect Behavioral Shift in Language Models‏

L Richter, X He, P Minervini, MJ Kusner - arxiv preprint arxiv:2410.19406, 2024‏ - arxiv.org‏

As language models (LMs) approach human-level performance, a comprehensive
understanding of their behavior becomes crucial. This includes evaluating capabilities …‏

שמור צטט מאמרים בנושא זה כל 4 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

CMD: a framework for Context-aware Model self-Detoxification‏

Z Tang, K Zhou, J Li, Y Ding, P Wang, B Yan… - arxiv preprint arxiv …, 2023‏ - arxiv.org‏

Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods of directly constraining the model output or further training …‏

שמור צטט צוטט על ידי 2 מאמרים בנושא זה כל 3 הגרסאות פתיחה בתור HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

What's the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios‏

X Liu, P Liu, D Yu - … of the 31st International Conference on …, 2025‏ - aclanthology.org‏

As large language models (LLMs) demonstrate impressive performance in various tasks and
are increasingly integrated into the decision-making process, ensuring they align with …‏

שמור צטט מאמרים בנושא זה פתיחה בתור HTML

יצירת התראה

צטט

חיפוש מתקדם

נשמר בספרייה שלי

Constructing highly inductive contexts for dialogue safety through controllable reverse generation

Ai alignment: A comprehensive survey‏

Discovering language model behaviors with model-written evaluations‏

Attack prompt generation for red teaming and defending large language models‏

Gaining wisdom from setbacks: Aligning large language models via mistake analysis‏

Autodetect: Towards a unified framework for automated weakness detection in large language models‏

InstructSafety: a unified framework for building multidimensional and explainable safety detector through instruction tuning‏

Towards safer generative language models: A survey on safety risks, evaluations, and improvements‏

An Auditing Test to Detect Behavioral Shift in Language Models‏

CMD: a framework for Context-aware Model self-Detoxification‏

What's the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios‏