Prati
Arthur Conmy
Arthur Conmy
Google DeepMind
Potvrđena adresa e-pošte na google.com - Početna stranica
Naslov
Citirano
Citirano
Godina
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small
KR Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt
ICLR 2023, 2022
4502022
Towards Automated Circuit Discovery for Mechanistic Interpretability
A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso
NeurIPS 2023 Spotlight, 2023
2492023
Stealing Part of a Production Language Model
N Carlini, D Paleka, KD Dvijotham, T Steinke, J Hayase, AF Cooper, ...
ICML 2024 Best Paper, 2024
772024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2
T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ...
BlackboxNLP 2024 Oral, 2024
602024
Improving Dictionary Learning with Gated Sparse Autoencoders
S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ...
NeurIPS 2024, 2024
502024
Attribution Patching Outperforms Automated Circuit Discovery
A Syed, C Rager, A Conmy
BlackboxNLP 2024, 2023
492023
Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads
CS McDougall, A Conmy, C Rushing, T McGrath, N Nanda
BlackboxNLP 2024, 2023
37*2023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild
R Gould, E Ong, G Ogden, A Conmy
ICLR 2024, 2023
352023
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders
S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ...
arXiv preprint arXiv:2407.14435, 2024
332024
Interpreting Attention Layer Outputs with Sparse Autoencoders
C Kissane, R Krzyzanowski, JI Bloom, A Conmy, N Nanda
ICML 2024 Mechanistic Interpretability Workshop Spotlight, 2024
28*2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features
S Chalnev, M Siu, A Conmy
arXiv preprint arXiv:2411.02193, 2024
7*2024
StyleGAN-induced Data-Driven Regularization for Inverse Problems
A Conmy, S Mukherjee, CB Schönlieb
IEEE ICASSP 2022, 2022
62022
SAEs (Usually) Transfer Between Base and Chat Models
C Kissane, R Krzyzanowski, A Conmy, N Nanda
alignmentforum.org/posts/fmwk6qxrpW8d4jvbd, 2024
52024
Activation Steering with SAEs
A Conmy, N Nanda
alignmentforum.org/posts/C5KAZQib3bzzpeyrg#Activation_Steering_with_SAEs, 2024
52024
Self-explaining SAE Features
D Kharlapenko, S Shabalin, N Nanda, A Conmy
alignmentforum.org/posts/self-explaining-sae-features, 2024
4*2024
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
E Farrell, YT Lau, A Conmy
Safe Generative AI Workshop at NeurIPS 2024, 2024
42024
Open Problems in Mechanistic Interpretability
L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...
arXiv preprint arXiv:2501.16496, 2025
32025
SAEs are Highly Dataset Dependent: a case study on the Refusal Direction
C Kissane, R Krzyzanowski, N Nanda, A Conmy
Alignment Forum, 2024
32024
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders
A Karvonen, C Rager, J Lin, C Tigges, J Bloom, D Chanin, YT Lau, ...
https://www.neuronpedia.org/sae-bench/info, 2025
2*2025
Progress Update #1 from the GDM Mech Interp Team
N Nanda, A Conmy, L Smith, S Rajamanoharan, T Lieberum, J Kramár, ...
alignmentforum.org/posts/C5KAZQib3bzzpeyrg, 2024
2*2024
Sustav trenutno ne može provesti ovu radnju. Pokušajte ponovo kasnije.
Članci 1–20