Towards Multi-dimensional Explanation Alignment for Medical Classification

L Hu, S Lai, W Chen, H **ao, H Lin… - Advances in …, 2025‏ - proceedings.neurips.cc
The lack of interpretability in the field of medical image analysis has significant ethical and
legal implications. Existing interpretable methods in this domain encounter several …

Steering language model refusal with sparse autoencoders

K O'Brien, D Majercak, X Fernandes, R Edgar… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Responsible practices for deploying language models include guiding models to recognize
and refuse answering prompts that are considered unsafe, while complying with safe …

Mqa-keal: Multi-hop question answering under knowledge editing for arabic language

MA Ali, N Daftardar, M Waheed, J Qin… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Large Language Models (LLMs) have demonstrated significant capabilities across
numerous application domains. A key challenge is to keep these models updated with latest …

Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning

L Zhang, L Hu, D Wang - arxiv preprint arxiv:2502.09022, 2025‏ - arxiv.org
Transformer-based language models have achieved notable success, yet their internal
reasoning mechanisms remain largely opaque due to complex non-linear interactions and …

EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification

L Zhang, W Dong, Z Zhang, S Yang, L Hu, N Liu… - arxiv preprint arxiv …, 2025‏ - arxiv.org
Understanding the internal mechanisms of transformer-based language models remains
challenging. Mechanistic interpretability based on circuit discovery aims to reverse engineer …