Deliberative alignment: Reasoning enables safer language models

MY Guan, M Joglekar, E Wallace, S Jain… - arxiv preprint arxiv …, 2024 - arxiv.org
As large-scale language models increasingly impact safety-critical domains, ensuring their
reliable adherence to well-defined principles remains a fundamental challenge. We …

Jailbreaking llm-controlled robots

A Robey, Z Ravichandran, V Kumar, H Hassani… - arxiv preprint arxiv …, 2024 - arxiv.org
The recent introduction of large language models (LLMs) has revolutionized the field of
robotics by enabling contextual reasoning and intuitive human-robot interaction in domains …

Llama guard 3 vision: Safeguarding human-ai image understanding conversations

J Chi, U Karn, H Zhan, E Smith, J Rando… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce Llama Guard 3 Vision, a multimodal LLM-based safeguard for human-AI
conversations that involves image understanding: it can be used to safeguard content for …

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking

J Zhu, L Yan, S Wang, D Yin, L Sha - arxiv preprint arxiv:2502.12970, 2025 - arxiv.org
The reasoning abilities of Large Language Models (LLMs) have demonstrated remarkable
advancement and exceptional performance across diverse domains. However, leveraging …

Why Safeguarded Ships Run Aground? Aligned Large Language Models' Safety Mechanisms Tend to Be Anchored in The Template Region

CT Leong, Q Yin, J Wang, W Li - arxiv preprint arxiv:2502.13946, 2025 - arxiv.org
The safety alignment of large language models (LLMs) remains vulnerable, as their initial
behavior can be easily jailbroken by even relatively simple attacks. Since infilling a fixed …

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

Y Huang, C Gao, S Wu, H Wang, X Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Generative Foundation Models (GenFMs) have emerged as transformative tools. However,
their widespread adoption raises critical concerns regarding trustworthiness across …

ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification

H Lee, S Oh, J Kim, J Shin, J Tack - arxiv preprint arxiv:2502.14565, 2025 - arxiv.org
Self-awareness, ie, the ability to assess and correct one's own generation, is a fundamental
aspect of human intelligence, making its replication in large language models (LLMs) an …

Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of Language Models

XW Yang, XY Zhu, WD Wei, DC Zhang, JJ Shao… - arxiv preprint arxiv …, 2025 - arxiv.org
The integration of slow-thinking mechanisms into large language models (LLMs) offers a
promising way toward achieving Level 2 AGI Reasoners, as exemplified by systems like …