- Academic Search

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse-engineering the computational …

Zapisz Cytuj Cytowane przez 75 Powiązane artykuły Wszystkie wersje 2 Wersja HTML

Opening the black box of large language models: Two views on holistic interpretability

H Zhao, F Yang, H Lakkaraju, M Du - arxiv e-prints, 2024 - ui.adsabs.harvard.edu

As large language models (LLMs) grow more powerful, concerns around potential harms
like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment …

Zapisz Cytuj Cytowane przez 15 Powiązane artykuły

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Q Zhang, Y Wang, J Cui, X Pan, Q Lei… - arxiv preprint arxiv …, 2024 - arxiv.org

Deep learning models often suffer from a lack of interpretability due to polysemanticity,
where individual neurons are activated by multiple unrelated semantics, resulting in unclear …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

MU Haider, H Rizwan, H Sajjad, P Ju… - arxiv preprint arxiv …, 2025 - arxiv.org

Interpreting and controlling the internal mechanisms of large language models (LLMs) is
crucial for improving their trustworthiness and utility. Recent efforts have primarily focused …

Zapisz Cytuj Powiązane artykuły Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SAFR: Neuron Redistribution for Interpretability

R Chang, C Deng, H Chen - arxiv preprint arxiv:2501.16374, 2025 - arxiv.org

Superposition refers to encoding representations of multiple features within a single neuron,
which is common in transformers. This property allows neurons to combine and represent …

Zapisz Cytuj Powiązane artykuły Wszystkie wersje 2 Wersja HTML

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

NEURONAL ENTANGLEMENT, AND SPARSITY

W DISTANCES - openreview.net

Disentangling polysemantic neurons is at the core of many current approaches to
interpretability of large language models. Here we attempt to study how disentanglement …

Zapisz Cytuj Powiązane artykuły Wersja HTML

Cytuj

Szukanie zaawansowane

Zapisano w Mojej bibliotece

Mechanistic Interpretability for AI Safety--A Review

Opening the black box of large language models: Two views on holistic interpretability

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

SAFR: Neuron Redistribution for Interpretability

NEURONAL ENTANGLEMENT, AND SPARSITY