Arthur Conmy

Citat de

	Toate	Din 2020
Referințe bibliografice	1134	1134
h-index	10	10
i10-index	10	10

760

380

190

570

20222023202420255 157 743 226

Coautori

Neel NandaMechanistic Interpretability Team Lead, Google DeepMindAdresă de e-mail confirmată pe deepmind.com
Alexandre VariengienENS de Lyon & EPFLAdresă de e-mail confirmată pe ens-lyon.fr
Aengus LynchUniversity College London, MATSAdresă de e-mail confirmată pe ucl.ac.uk
Senthooran RajamanoharanGoogle DeepMindAdresă de e-mail confirmată pe google.com
Rohin ShahResearch Scientist, Google DeepMindAdresă de e-mail confirmată pe deepmind.com
Jacob SteinhardtStanford UniversityAdresă de e-mail confirmată pe cs.stanford.edu
Connor KissaneIndependentAdresă de e-mail confirmată pe richmond.edu
Robert KrzyzanowskiPoseidon ResearchAdresă de e-mail confirmată pe poseidonresearch.com
Adrià Garriga-AlonsoResearch Scientist, FAR AIAdresă de e-mail confirmată pe far.ai
Nicholas CarliniGoogle DeepMindAdresă de e-mail confirmată pe google.com
Daniel PalekaETH ZurichAdresă de e-mail confirmată pe inf.ethz.ch
Anca D DraganAssistant Professor at UC Berkeley // Director, AI Safety and Alignment, Google DeepMindAdresă de e-mail confirmată pe berkeley.edu
Aaquib SyedMATS 5.0 | Student, University of MarylandAdresă de e-mail confirmată pe umd.edu
Rhys GouldMathematics Undergraduate, University of CambridgeAdresă de e-mail confirmată pe cam.ac.uk
Euan OngResearch Assistant, University of CambridgeAdresă de e-mail confirmată pe cam.ac.uk
Joseph Isaac BloomUK AI Safety InstituteAdresă de e-mail confirmată pe dsit.gov.uk
Stepan ShabalinGeorgia Institute of TechnologyAdresă de e-mail confirmată pe gatech.edu
Rowan WangAdresă de e-mail confirmată pe rdwrs.com
Bilal ChughtaiGoogle DeepMindAdresă de e-mail confirmată pe google.com

Urmăriți

Arthur Conmy

Google DeepMind

Adresă de e-mail confirmată pe google.com - Pagina de pornire

Mechanistic Interpretability Interpretability AGI Safety AI Safety Machine Learning


Titlu Sortați după descrierea bibliografică Sortați după an Sortați după titlu	Citat de Citat de	Anul
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small KR Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt ICLR 2023, 2022	455	2022
Towards Automated Circuit Discovery for Mechanistic Interpretability A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso NeurIPS 2023 Spotlight, 2023	255	2023
Stealing Part of a Production Language Model N Carlini, D Paleka, KD Dvijotham, T Steinke, J Hayase, AF Cooper, ... ICML 2024 Best Paper, 2024	78	2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... BlackboxNLP 2024 Oral, 2024	63	2024
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... NeurIPS 2024, 2024	52	2024
Attribution Patching Outperforms Automated Circuit Discovery A Syed, C Rager, A Conmy BlackboxNLP 2024, 2023	51	2023
Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads CS McDougall, A Conmy, C Rushing, T McGrath, N Nanda BlackboxNLP 2024, 2023	38*	2023
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024	35	2024
Successor Heads: Recurring, Interpretable Attention Heads In The Wild R Gould, E Ong, G Ogden, A Conmy ICLR 2024, 2023	35	2023
Interpreting Attention Layer Outputs with Sparse Autoencoders C Kissane, R Krzyzanowski, JI Bloom, A Conmy, N Nanda ICML 2024 Mechanistic Interpretability Workshop Spotlight, 2024	29*	2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features S Chalnev, M Siu, A Conmy arXiv preprint arXiv:2411.02193, 2024	9*	2024
StyleGAN-induced Data-Driven Regularization for Inverse Problems A Conmy, S Mukherjee, CB Schönlieb IEEE ICASSP 2022, 2022	6	2022
SAEs (Usually) Transfer Between Base and Chat Models C Kissane, R Krzyzanowski, A Conmy, N Nanda alignmentforum.org/posts/fmwk6qxrpW8d4jvbd, 2024	5	2024
Activation Steering with SAEs A Conmy, N Nanda alignmentforum.org/posts/C5KAZQib3bzzpeyrg#Activation_Steering_with_SAEs, 2024	5	2024
Self-explaining SAE Features D Kharlapenko, S Shabalin, N Nanda, A Conmy alignmentforum.org/posts/self-explaining-sae-features, 2024	4*	2024
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models E Farrell, YT Lau, A Conmy Safe Generative AI Workshop at NeurIPS 2024, 2024	4	2024
Open Problems in Mechanistic Interpretability L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ... arXiv preprint arXiv:2501.16496, 2025	3	2025
SAEs are Highly Dataset Dependent: a case study on the Refusal Direction C Kissane, R Krzyzanowski, N Nanda, A Conmy Alignment Forum, 2024	3	2024
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders A Karvonen, C Rager, J Lin, C Tigges, J Bloom, D Chanin, YT Lau, ... https://www.neuronpedia.org/sae-bench/info, 2025	2*	2025
Progress Update #1 from the GDM Mech Interp Team N Nanda, A Conmy, L Smith, S Rajamanoharan, T Lieberum, J Kramár, ... alignmentforum.org/posts/C5KAZQib3bzzpeyrg, 2024	2*	2024

Sistemul nu poate realiza operația în acest moment. Încercați din nou mai târziu.

Articole 1–20

Referințe bibliografice pe an

Citate duplicat

Citate fuzionate

Adăugați coautoriCoautori

Urmăriți

Citat de

Coautori