Arthur Conmy

Citirano

	Sve	Od 2020.
Citati	1109	1109
H-indeks	10	10
i10-indeks	10	10

780

390

195

585

20222023202420255 157 762 182

Suautori

Neel NandaMechanistic Interpretability Team Lead, Google DeepMindPotvrđena adresa e-pošte na deepmind.com
Alexandre VariengienENS de Lyon & EPFLPotvrđena adresa e-pošte na ens-lyon.fr
Aengus LynchUniversity College London, MATSPotvrđena adresa e-pošte na ucl.ac.uk
Senthooran RajamanoharanGoogle DeepMindPotvrđena adresa e-pošte na google.com
Rohin ShahResearch Scientist, Google DeepMindPotvrđena adresa e-pošte na deepmind.com
Jacob SteinhardtStanford UniversityPotvrđena adresa e-pošte na cs.stanford.edu
Connor KissaneIndependentPotvrđena adresa e-pošte na richmond.edu
Robert KrzyzanowskiPoseidon ResearchPotvrđena adresa e-pošte na poseidonresearch.com
Adrià Garriga-AlonsoResearch Scientist, FAR AIPotvrđena adresa e-pošte na far.ai
Nicholas CarliniGoogle DeepMindPotvrđena adresa e-pošte na google.com
Daniel PalekaETH ZurichPotvrđena adresa e-pošte na inf.ethz.ch
Anca D DraganAssistant Professor at UC Berkeley // Director, AI Safety and Alignment, Google DeepMindPotvrđena adresa e-pošte na berkeley.edu
Aaquib SyedMATS 5.0 | Student, University of MarylandPotvrđena adresa e-pošte na umd.edu
Rhys GouldMathematics Undergraduate, University of CambridgePotvrđena adresa e-pošte na cam.ac.uk
Euan OngResearch Assistant, University of CambridgePotvrđena adresa e-pošte na cam.ac.uk
Joseph Isaac BloomUK AI Safety InstitutePotvrđena adresa e-pošte na dsit.gov.uk
Stepan ShabalinGeorgia Institute of TechnologyPotvrđena adresa e-pošte na gatech.edu
Rowan WangPotvrđena adresa e-pošte na rdwrs.com
Bilal ChughtaiGoogle DeepMindPotvrđena adresa e-pošte na google.com

Prati

Arthur Conmy

Google DeepMind

Potvrđena adresa e-pošte na google.com - Početna stranica

Mechanistic Interpretability AI Safety


Naslov Poredaj po navodima Poredaj po godini Poredaj po naslovu	Citirano Citirano	Godina
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small KR Wang, A Variengien, A Conmy, B Shlegeris, J Steinhardt ICLR 2023, 2022	450	2022
Towards Automated Circuit Discovery for Mechanistic Interpretability A Conmy, AN Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso NeurIPS 2023 Spotlight, 2023	249	2023
Stealing Part of a Production Language Model N Carlini, D Paleka, KD Dvijotham, T Steinke, J Hayase, AF Cooper, ... ICML 2024 Best Paper, 2024	77	2024
Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... BlackboxNLP 2024 Oral, 2024	60	2024
Improving Dictionary Learning with Gated Sparse Autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... NeurIPS 2024, 2024	50	2024
Attribution Patching Outperforms Automated Circuit Discovery A Syed, C Rager, A Conmy BlackboxNLP 2024, 2023	49	2023
Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads CS McDougall, A Conmy, C Rushing, T McGrath, N Nanda BlackboxNLP 2024, 2023	37*	2023
Successor Heads: Recurring, Interpretable Attention Heads In The Wild R Gould, E Ong, G Ogden, A Conmy ICLR 2024, 2023	35	2023
Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024	33	2024
Interpreting Attention Layer Outputs with Sparse Autoencoders C Kissane, R Krzyzanowski, JI Bloom, A Conmy, N Nanda ICML 2024 Mechanistic Interpretability Workshop Spotlight, 2024	28*	2024
Improving Steering Vectors by Targeting Sparse Autoencoder Features S Chalnev, M Siu, A Conmy arXiv preprint arXiv:2411.02193, 2024	7*	2024
StyleGAN-induced Data-Driven Regularization for Inverse Problems A Conmy, S Mukherjee, CB Schönlieb IEEE ICASSP 2022, 2022	6	2022
SAEs (Usually) Transfer Between Base and Chat Models C Kissane, R Krzyzanowski, A Conmy, N Nanda alignmentforum.org/posts/fmwk6qxrpW8d4jvbd, 2024	5	2024
Activation Steering with SAEs A Conmy, N Nanda alignmentforum.org/posts/C5KAZQib3bzzpeyrg#Activation_Steering_with_SAEs, 2024	5	2024
Self-explaining SAE Features D Kharlapenko, S Shabalin, N Nanda, A Conmy alignmentforum.org/posts/self-explaining-sae-features, 2024	4*	2024
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models E Farrell, YT Lau, A Conmy Safe Generative AI Workshop at NeurIPS 2024, 2024	4	2024
Open Problems in Mechanistic Interpretability L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ... arXiv preprint arXiv:2501.16496, 2025	3	2025
SAEs are Highly Dataset Dependent: a case study on the Refusal Direction C Kissane, R Krzyzanowski, N Nanda, A Conmy Alignment Forum, 2024	3	2024
SAEBench: A Comprehensive Benchmark for Sparse Autoencoders A Karvonen, C Rager, J Lin, C Tigges, J Bloom, D Chanin, YT Lau, ... https://www.neuronpedia.org/sae-bench/info, 2025	2*	2025
Progress Update #1 from the GDM Mech Interp Team N Nanda, A Conmy, L Smith, S Rajamanoharan, T Lieberum, J Kramár, ... alignmentforum.org/posts/C5KAZQib3bzzpeyrg, 2024	2*	2024

Sustav trenutno ne može provesti ovu radnju. Pokušajte ponovo kasnije.

Članci 1–20

Godišnji broj citata

Dvostruki navodi

Spojeni navodi

Dodavanje suautoraSuautori

Prati

Citirano

Suautori