Enhancing large vision language models with self-training on image comprehension

Y Deng, P Lu, F Yin, Z Hu, S Shen, Q Gu, J Zou… - arxiv preprint arxiv …, 2024 - arxiv.org
Large vision language models (LVLMs) integrate large language models (LLMs) with pre-
trained vision encoders, thereby activating the perception capability of the model to …

Star-gate: Teaching language models to ask clarifying questions

C Andukuri, JP Fränken, T Gerstenberg… - arxiv preprint arxiv …, 2024 - arxiv.org
When prompting language models to complete a task, users often leave important aspects
unsaid. While asking questions could resolve this ambiguity (GATE; Li et al., 2023), models …

Generative reward models

D Mahan, D Van Phung, R Rafailov, C Blagden… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement Learning from Human Feedback (RLHF) has greatly improved the
performance of modern Large Language Models (LLMs). The RLHF process is resource …

PERSONA: A Reproducible Testbed for Pluralistic Alignment

L Castricato, N Lile, R Rafailov, JP Fränken… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of language models (LMs) necessitates robust alignment with
diverse user values. However, current preference optimization approaches often fail to …

Aligning large language models via self-steering optimization

H **ang, B Yu, H Lin, K Lu, Y Lu, X Han, L Sun… - arxiv preprint arxiv …, 2024 - arxiv.org
Automated alignment develops alignment systems with minimal human intervention. The
key to automated alignment lies in providing learnable and accurate preference signals for …

Is Free Self-Alignment Possible?

D Adila, C Shin, Y Zhang, F Sala - arxiv preprint arxiv:2406.03642, 2024 - arxiv.org
Aligning pretrained language models (LMs) is a complex and resource-intensive process,
often requiring access to large amounts of ground-truth preference data and substantial …

LLM Safety Alignment is Divergence Estimation in Disguise

R Haldar, Z Wang, Q Song, G Lin, Y **ng - arxiv preprint arxiv:2502.00657, 2025 - arxiv.org
We propose a theoretical framework demonstrating that popular Large Language Model
(LLM) alignment methods, including Reinforcement Learning from Human Feedback (RLHF) …

Can Language Models Safeguard Themselves, Instantly and For Free?

D Adila, C Shin, Y Zhang, F Sala - ICML 2024 Next Generation of AI Safety … - openreview.net
Aligning pretrained language models (LMs) to handle a new safety scenario is normally
difficult and expensive, often requiring access to large amounts of ground-truth preference …

[PDF][PDF] Generative Reward Models-A Unified Approach to RLHF and RLAIF

D Mahan, D Van Phung, R Rafailov, CBNLL Castricato - static.synthlabs.ai
Reinforcement Learning from Human Feedback (RLHF) has greatly improved the
performance of modern Large Language Models (LLMs). The RLHF process is resource …