Google Наука

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Запазване Позоваване С позовавания в 98 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A practical review of mechanistic interpretability for transformer-based language models

D Rai, Y Zhou, S Feng, A Saparov, Z Yao - arxiv preprint arxiv …, 2024 - arxiv.org

Mechanistic interpretability (MI) is an emerging sub-field of interpretability that seeks to
understand a neural network model by reverse-engineering its internal computations …

Запазване Позоваване С позовавания в 31 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Contextcite: Attributing model generation to context

B Cohen-Wang, H Shah… - Advances in Neural …, 2025 - proceedings.neurips.cc

How do language models use information provided as context when generating a
response? Can we infer whether a particular generated statement is actually grounded in …

Запазване Позоваване С позовавания в 12 Сродни статии Всички 6 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Attribution patching outperforms automated circuit discovery

A Syed, C Rager, A Conmy - arxiv preprint arxiv:2310.10348, 2023 - arxiv.org

Automated interpretability research has recently attracted attention as a potential research
direction that could scale explanations of neural network behavior to large models. Existing …

Запазване Позоваване С позовавания в 47 Сродни статии Всички 5 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Information flow routes: Automatically interpreting language models at scale

J Ferrando, E Voita - arxiv preprint arxiv:2403.00824, 2024 - arxiv.org

Information flows by routes inside the network via mechanisms implemented in the model.
These routes can be represented as graphs where nodes correspond to token …

Запазване Позоваване С позовавания в 23 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Decomposing and editing predictions by modeling model computation

H Shah, A Ilyas, A Madry - arxiv preprint arxiv:2404.11534, 2024 - arxiv.org

How does the internal computation of a machine learning model transform inputs into
predictions? In this paper, we introduce a task called component modeling that aims to …

Запазване Позоваване С позовавания в 18 Сродни статии Всички 7 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

AI and the Problem of Knowledge Collapse

AJ Peterson - AI & SOCIETY, 2025 - Springer

While artificial intelligence has the potential to process vast amounts of data, generate new
insights, and unlock greater productivity, its widespread adoption may entail unforeseen …

Запазване Позоваване С позовавания в 19 Сродни статии Всички 8 версии

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Improving sparse decomposition of language model activations with gated sparse autoencoders

S Rajamanoharan, A Conmy, L Smith… - Advances in …, 2025 - proceedings.neurips.cc

Recent work has found that sparse autoencoders (SAEs) are an effective technique for
unsupervised discovery of interpretable features in language models'(LMs) activations, by …

Запазване Позоваване С позовавания в 3 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm circuit analyses are consistent across training and scale

C Tigges, M Hanna, Q Yu, S Biderman - arxiv preprint arxiv:2407.10827, 2024 - arxiv.org

Most currently deployed large language models (LLMs) undergo continuous training or
additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on …

Запазване Позоваване С позовавания в 5 Сродни статии Всички 6 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Answer, assemble, ace: Understanding how transformers answer multiple choice questions

S Wiegreffe, O Tafjord, Y Belinkov, H Hajishirzi… - arxiv preprint arxiv …, 2024 - arxiv.org

Multiple-choice question answering (MCQA) is a key competence of performant transformer
language models that is tested by mainstream benchmarks. However, recent evidence …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 2 версии Във вид на HTML

Създаване на сигнал

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Atp*: An efficient and scalable method for localizing llm behaviour to components

Mechanistic Interpretability for AI Safety--A Review

A practical review of mechanistic interpretability for transformer-based language models

Contextcite: Attributing model generation to context

Attribution patching outperforms automated circuit discovery

Information flow routes: Automatically interpreting language models at scale

Decomposing and editing predictions by modeling model computation

AI and the Problem of Knowledge Collapse

Improving sparse decomposition of language model activations with gated sparse autoencoders

Llm circuit analyses are consistent across training and scale

Answer, assemble, ace: Understanding how transformers answer multiple choice questions