Beavertails: Towards improved safety alignment of llm via a human-preference dataset

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2024 - proceedings.neurips.cc
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …

Large language model alignment: A survey

T Shen, R **, Y Huang, C Liu, W Dong, Z Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …

Universal jailbreak backdoors from poisoned human feedback

J Rando, F Tramèr - arxiv preprint arxiv:2311.14455, 2023 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) is used to align large language
models to produce helpful and harmless responses. Yet, prior work showed these models …

Gaining wisdom from setbacks: Aligning large language models via mistake analysis

K Chen, C Wang, K Yang, J Han, L Hong, F Mi… - arxiv preprint arxiv …, 2023 - arxiv.org
The rapid advancement of large language models (LLMs) presents both opportunities and
challenges, particularly concerning unintentional generation of harmful and toxic responses …

Unmasking and improving data credibility: A study with datasets for training harmless language models

Z Zhu, J Wang, H Cheng, Y Liu - arxiv preprint arxiv:2311.11202, 2023 - arxiv.org
Language models have shown promise in various tasks but can be affected by undesired
data during training, fine-tuning, or alignment. For example, if some unsafe conversations …

Are Large Language Models Really Robust to Word-Level Perturbations?

H Wang, G Ma, C Yu, N Gui, L Zhang, Z Huang… - arxiv preprint arxiv …, 2023 - arxiv.org
The swift advancement in the scales and capabilities of Large Language Models (LLMs)
positions them as promising tools for a variety of downstream tasks. In addition to the pursuit …

Red teaming game: A game-theoretic framework for red teaming language models

C Ma, Z Yang, M Gao, H Ci, J Gao, X Pan… - arxiv preprint arxiv …, 2023 - arxiv.org
Deployable Large Language Models (LLMs) must conform to the criterion of helpfulness and
harmlessness, thereby achieving consistency between LLMs outputs and human values …

Measuring value understanding in language models through discriminator-critique gap

Z Zhang, F Bai, J Gao, Y Yang - arxiv preprint arxiv:2310.00378, 2023 - arxiv.org
Recent advancements in Large Language Models (LLMs) have heightened concerns about
their potential misalignment with human values. However, evaluating their grasp of these …

Heterogeneous Value Alignment Evaluation for Large Language Models

Z Zhang, C Zhang, N Liu, S Qi, Z Rong, SC Zhu… - arxiv preprint arxiv …, 2023 - arxiv.org
The emergent capabilities of Large Language Models (LLMs) have made it crucial to align
their values with those of humans. However, current methodologies typically attempt to …

Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models

Z Ziheng, Y Wu, SC Zhu, D Terzopoulos - arxiv preprint arxiv:2312.05503, 2023 - arxiv.org
We introduce Aligner, a novel Parameter-Efficient Fine-Tuning (PEFT) method for aligning
multi-billion-parameter-sized Large Language Models (LLMs). Aligner employs a unique …