xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
Safety alignment mechanism are essential for preventing large language models (LLMs)
from generating harmful information or unethical content. However, cleverly crafted prompts …
from generating harmful information or unethical content. However, cleverly crafted prompts …
[PDF][PDF] Sparse Autoencoders for Interpretability in Reinforcement Learning Models
C DuPlessie - 2024 - math.mit.edu
Sparse Autoencoders for Interpretability in Reinforcement Learning Models Page 1 Introduction
State of the Art Reinforcement Learning Interpretability Conclusion Sparse Autoencoders for …
State of the Art Reinforcement Learning Interpretability Conclusion Sparse Autoencoders for …