Beavertails: Towards improved safety alignment of llm via a human-preference dataset
In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …
alignment in large language models (LLMs). This dataset uniquely separates annotations of …
Large language model alignment: A survey
Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …
Such advancements, while garnering significant attention, have concurrently elicited various …
Universal jailbreak backdoors from poisoned human feedback
Reinforcement Learning from Human Feedback (RLHF) is used to align large language
models to produce helpful and harmless responses. Yet, prior work showed these models …
models to produce helpful and harmless responses. Yet, prior work showed these models …
Gaining wisdom from setbacks: Aligning large language models via mistake analysis
The rapid advancement of large language models (LLMs) presents both opportunities and
challenges, particularly concerning unintentional generation of harmful and toxic responses …
challenges, particularly concerning unintentional generation of harmful and toxic responses …
Unmasking and improving data credibility: A study with datasets for training harmless language models
Language models have shown promise in various tasks but can be affected by undesired
data during training, fine-tuning, or alignment. For example, if some unsafe conversations …
data during training, fine-tuning, or alignment. For example, if some unsafe conversations …
Are Large Language Models Really Robust to Word-Level Perturbations?
The swift advancement in the scales and capabilities of Large Language Models (LLMs)
positions them as promising tools for a variety of downstream tasks. In addition to the pursuit …
positions them as promising tools for a variety of downstream tasks. In addition to the pursuit …
Red teaming game: A game-theoretic framework for red teaming language models
Deployable Large Language Models (LLMs) must conform to the criterion of helpfulness and
harmlessness, thereby achieving consistency between LLMs outputs and human values …
harmlessness, thereby achieving consistency between LLMs outputs and human values …
Measuring value understanding in language models through discriminator-critique gap
Recent advancements in Large Language Models (LLMs) have heightened concerns about
their potential misalignment with human values. However, evaluating their grasp of these …
their potential misalignment with human values. However, evaluating their grasp of these …
Heterogeneous Value Alignment Evaluation for Large Language Models
The emergent capabilities of Large Language Models (LLMs) have made it crucial to align
their values with those of humans. However, current methodologies typically attempt to …
their values with those of humans. However, current methodologies typically attempt to …
Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models
We introduce Aligner, a novel Parameter-Efficient Fine-Tuning (PEFT) method for aligning
multi-billion-parameter-sized Large Language Models (LLMs). Aligner employs a unique …
multi-billion-parameter-sized Large Language Models (LLMs). Aligner employs a unique …