Google Academic

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org

AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Salvați Citați Citat de 247 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Salvați Citați Citat de 96 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Kan: Kolmogorov-arnold networks

Z Liu, Y Wang, S Vaidya, F Ruehle, J Halverson… - arxiv preprint arxiv …, 2024 - arxiv.org

Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold
Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs …

Salvați Citați Citat de 775 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc

Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

Salvați Citați Citat de 244 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language models represent space and time

W Gurnee, M Tegmark - arxiv preprint arxiv:2310.02207, 2023 - arxiv.org

The capabilities of large language models (LLMs) have sparked debate over whether such
systems just learn an enormous collection of superficial statistics or a set of more coherent …

Salvați Citați Citat de 198 ori Articole cu conținut similar Toate cele 3 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Salvați Citați Citat de 101 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Eliciting latent predictions from transformers with the tuned lens

N Belrose, Z Furman, L Smith, D Halawi… - arxiv preprint arxiv …, 2023 - arxiv.org

We analyze transformers from the perspective of iterative inference, seeking to understand
how model predictions are refined layer by layer. To do so, we train an affine probe for each …

Salvați Citați Citat de 152 ori Articole cu conținut similar Toate cele 2 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

T Räuker, A Ho, S Casper… - 2023 ieee conference …, 2023 - ieeexplore.ieee.org

The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …

Salvați Citați Citat de 191 ori Articole cu conținut similar Toate cele 5 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Birth of a transformer: A memory viewpoint

A Bietti, V Cabannes, D Bouchacourt… - Advances in …, 2023 - proceedings.neurips.cc

Large language models based on transformers have achieved great empirical successes.
However, as they are deployed more widely, there is a growing need to better understand …

Salvați Citați Citat de 73 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

The clock and the pizza: Two stories in mechanistic explanation of neural networks

Z Zhong, Z Liu, M Tegmark… - Advances in neural …, 2023 - proceedings.neurips.cc

Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …

Salvați Citați Citat de 78 ori Articole cu conținut similar Toate cele 5 versiuni Afișare ca HTML

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

Toy models of superposition

Ai alignment: A comprehensive survey

Mechanistic Interpretability for AI Safety--A Review

Kan: Kolmogorov-arnold networks

Towards automated circuit discovery for mechanistic interpretability

Language models represent space and time

Scaling and evaluating sparse autoencoders

Eliciting latent predictions from transformers with the tuned lens

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

Birth of a transformer: A memory viewpoint

The clock and the pizza: Two stories in mechanistic explanation of neural networks