Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023‏ - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Discovering language model behaviors with model-written evaluations

E Perez, S Ringer, K Lukosiute, K Nguyen… - Findings of the …, 2023‏ - aclanthology.org
As language models (LMs) scale, they develop many novel behaviors, good and bad,
exacerbating the need to evaluate how they behave. Prior work creates evaluations with …

Attack prompt generation for red teaming and defending large language models

B Deng, W Wang, F Feng, Y Deng, Q Wang… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Large language models (LLMs) are susceptible to red teaming attacks, which can induce
LLMs to generate harmful content. Previous research constructs attack prompts via manual …

Gaining wisdom from setbacks: Aligning large language models via mistake analysis

K Chen, C Wang, K Yang, J Han, L Hong, F Mi… - arxiv preprint arxiv …, 2023‏ - arxiv.org
The rapid development of large language models (LLMs) has not only provided numerous
opportunities but also presented significant challenges. This becomes particularly evident …

Autodetect: Towards a unified framework for automated weakness detection in large language models

J Cheng, Y Lu, X Gu, P Ke, X Liu, Y Dong… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Although Large Language Models (LLMs) are becoming increasingly powerful, they still
exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding …

InstructSafety: a unified framework for building multidimensional and explainable safety detector through instruction tuning

Z Zhang, J Cheng, H Sun, J Deng… - Findings of the …, 2023‏ - aclanthology.org
Safety detection has been an increasingly important topic in recent years and it has become
even more necessary to develop reliable safety detection systems with the rapid …

Towards safer generative language models: A survey on safety risks, evaluations, and improvements

J Deng, J Cheng, H Sun, Z Zhang, M Huang - arxiv preprint arxiv …, 2023‏ - arxiv.org
As generative large model capabilities advance, safety concerns become more pronounced
in their outputs. To ensure the sustainable growth of the AI ecosystem, it's imperative to …

An Auditing Test to Detect Behavioral Shift in Language Models

L Richter, X He, P Minervini, MJ Kusner - arxiv preprint arxiv:2410.19406, 2024‏ - arxiv.org
As language models (LMs) approach human-level performance, a comprehensive
understanding of their behavior becomes crucial. This includes evaluating capabilities …

CMD: a framework for Context-aware Model self-Detoxification

Z Tang, K Zhou, J Li, Y Ding, P Wang, B Yan… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Text detoxification aims to minimize the risk of language models producing toxic content.
Existing detoxification methods of directly constraining the model output or further training …

What's the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios

X Liu, P Liu, D Yu - … of the 31st International Conference on …, 2025‏ - aclanthology.org
As large language models (LLMs) demonstrate impressive performance in various tasks and
are increasingly integrated into the decision-making process, ensuring they align with …