Google Наука

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Запазване Позоваване С позовавания в 98 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Запазване Позоваване С позовавания в 103 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Запазване Позоваване С позовавания в 65 Сродни статии Всички 5 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards principled evaluations of sparse autoencoders for interpretability and control

A Makelov, G Lange, N Nanda - arxiv preprint arxiv:2405.08366, 2024 - arxiv.org

Disentangling model activations into meaningful features is a central problem in
interpretability. However, the absence of ground-truth for these features in realistic scenarios …

Запазване Позоваване С позовавания в 25 Сродни статии Всички 6 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] rug.nl

A primer on the inner workings of transformer-based language models

J Ferrando, G Sarti, A Bisazza, M Costa-jussà - 2024 - research.rug.nl

The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …

Запазване Позоваване С позовавания в 49 Сродни статии Всички 4 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Relational composition in neural networks: A survey and call to action

M Wattenberg, FB Viégas - arxiv preprint arxiv:2407.14662, 2024 - arxiv.org

Many neural nets appear to represent data as linear combinations of" feature vectors."
Algorithms for discovering these vectors have seen impressive recent success. However, we …

Запазване Позоваване С позовавания в 4 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mathematical models of computation in superposition

K Hänni, J Mendel, D Vaintrob, L Chan - arxiv preprint arxiv:2408.05451, 2024 - arxiv.org

Superposition--when a neural network represents more``features''than it has dimensions--
seems to pose a serious challenge to mechanistically interpreting current AI systems …

Запазване Позоваване С позовавания в 7 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sparse autoencoders reveal universal feature spaces across large language models

M Lan, P Torr, A Meek, A Khakzar, D Krueger… - arxiv preprint arxiv …, 2024 - arxiv.org

We investigate feature universality in large language models (LLMs), a research field that
aims to understand how different models similarly represent concepts in the latent spaces of …

Запазване Позоваване С позовавания в 6 Сродни статии Всички 3 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Disentangling dense embeddings with sparse autoencoders

C O'Neill, C Ye, K Iyer, JF Wu - arxiv preprint arxiv:2408.00657, 2024 - arxiv.org

Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from
complex neural networks. We present one of the first applications of SAEs to dense text …

Запазване Позоваване С позовавания в 4 Сродни статии Всички 2 версии Във вид на HTML

[Free GPT-4]
[DeepSeek]

[PDF] springer.com

Explaining AI through mechanistic interpretability

L Kästner, B Crook - European Journal for Philosophy of Science, 2024 - Springer

Recent work in explainable artificial intelligence (XAI) attempts to render opaque AI systems
understandable through a divide-and-conquer strategy. However, this fails to illuminate how …

Запазване Позоваване С позовавания в 11 Сродни статии Всички 4 версии

Позоваване

Разширено търсене

Запазено в „Моята библиотека“

Mechanistic Interpretability for AI Safety--A Review

Scaling and evaluating sparse autoencoders

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

Towards principled evaluations of sparse autoencoders for interpretability and control

A primer on the inner workings of transformer-based language models

Relational composition in neural networks: A survey and call to action

Mathematical models of computation in superposition

Sparse autoencoders reveal universal feature spaces across large language models

Disentangling dense embeddings with sparse autoencoders

Explaining AI through mechanistic interpretability