xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking

S Lee, S Ni, C Wei, S Li, L Fan, A Argha… - arxiv preprint arxiv …, 2025 - arxiv.org
Safety alignment mechanism are essential for preventing large language models (LLMs)
from generating harmful information or unethical content. However, cleverly crafted prompts …

[PDF][PDF] Sparse Autoencoders for Interpretability in Reinforcement Learning Models

C DuPlessie - 2024 - math.mit.edu
Sparse Autoencoders for Interpretability in Reinforcement Learning Models Page 1 Introduction
State of the Art Reinforcement Learning Interpretability Conclusion Sparse Autoencoders for …