- Academic Search

J Ji, M Liu, J Dai, X Pan, C Zhang… - Advances in …, 2024 - proceedings.neurips.cc

In this paper, we introduce the BeaverTails dataset, aimed at fostering research on safety
alignment in large language models (LLMs). This dataset uniquely separates annotations of …

Simpan Kutip Dirujuk 307 kali Artikel terkait 8 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Large language model alignment: A survey

T Shen, R **, Y Huang, C Liu, W Dong, Z Guo… - arxiv preprint arxiv …, 2023 - arxiv.org

Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …

Simpan Kutip Dirujuk 151 kali Artikel terkait 2 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Universal jailbreak backdoors from poisoned human feedback

J Rando, F Tramèr - arxiv preprint arxiv:2311.14455, 2023 - arxiv.org

Reinforcement Learning from Human Feedback (RLHF) is used to align large language
models to produce helpful and harmless responses. Yet, prior work showed these models …

Simpan Kutip Dirujuk 55 kali Artikel terkait 5 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Gaining wisdom from setbacks: Aligning large language models via mistake analysis

K Chen, C Wang, K Yang, J Han, L Hong, F Mi… - arxiv preprint arxiv …, 2023 - arxiv.org

The rapid advancement of large language models (LLMs) presents both opportunities and
challenges, particularly concerning unintentional generation of harmful and toxic responses …

Simpan Kutip Dirujuk 24 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Unmasking and improving data credibility: A study with datasets for training harmless language models

Z Zhu, J Wang, H Cheng, Y Liu - arxiv preprint arxiv:2311.11202, 2023 - arxiv.org

Language models have shown promise in various tasks but can be affected by undesired
data during training, fine-tuning, or alignment. For example, if some unsafe conversations …

Simpan Kutip Dirujuk 17 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Are Large Language Models Really Robust to Word-Level Perturbations?

H Wang, G Ma, C Yu, N Gui, L Zhang, Z Huang… - arxiv preprint arxiv …, 2023 - arxiv.org

The swift advancement in the scales and capabilities of Large Language Models (LLMs)
positions them as promising tools for a variety of downstream tasks. In addition to the pursuit …

Simpan Kutip Dirujuk 22 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Red teaming game: A game-theoretic framework for red teaming language models

C Ma, Z Yang, M Gao, H Ci, J Gao, X Pan… - arxiv preprint arxiv …, 2023 - arxiv.org

Deployable Large Language Models (LLMs) must conform to the criterion of helpfulness and
harmlessness, thereby achieving consistency between LLMs outputs and human values …

Simpan Kutip Dirujuk 12 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Measuring value understanding in language models through discriminator-critique gap

Z Zhang, F Bai, J Gao, Y Yang - arxiv preprint arxiv:2310.00378, 2023 - arxiv.org

Recent advancements in Large Language Models (LLMs) have heightened concerns about
their potential misalignment with human values. However, evaluating their grasp of these …

Simpan Kutip Dirujuk 8 kali Artikel terkait 3 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Heterogeneous Value Alignment Evaluation for Large Language Models

Z Zhang, C Zhang, N Liu, S Qi, Z Rong, SC Zhu… - arxiv preprint arxiv …, 2023 - arxiv.org

The emergent capabilities of Large Language Models (LLMs) have made it crucial to align
their values with those of humans. However, current methodologies typically attempt to …

Simpan Kutip Dirujuk 6 kali Artikel terkait 4 versi Versi HTML

[Free GPT-4]

[PDF] arxiv.org

Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models

Z Ziheng, Y Wu, SC Zhu, D Terzopoulos - arxiv preprint arxiv:2312.05503, 2023 - arxiv.org

We introduce Aligner, a novel Parameter-Efficient Fine-Tuning (PEFT) method for aligning
multi-billion-parameter-sized Large Language Models (LLMs). Aligner employs a unique …

Simpan Kutip Artikel terkait 2 versi Versi HTML

Buat notifikasi

Kutip

Penelusuran lanjutan

Disimpan ke Koleksi saya

Pku-beaver: Constrained value-aligned llm via safe rlhf

Beavertails: Towards improved safety alignment of llm via a human-preference dataset

Large language model alignment: A survey

Universal jailbreak backdoors from poisoned human feedback

Gaining wisdom from setbacks: Aligning large language models via mistake analysis

Unmasking and improving data credibility: A study with datasets for training harmless language models

Are Large Language Models Really Robust to Word-Level Perturbations?

Red teaming game: A game-theoretic framework for red teaming language models

Measuring value understanding in language models through discriminator-critique gap

Heterogeneous Value Alignment Evaluation for Large Language Models

Aligner: One Global Token is Worth Millions of Parameters When Aligning Large Language Models