‪Tom Lieberum‬ - ‪Academic Search‬

Get my own profile

Cited by

	All	Since 2020
Citations	572	572
h-index	9	9
i10-index	8	8

0

440

220

110

330

20222023202420253 106 426 37

Tom Lieberum

Tom Lieberum

Google DeepMind

Verified email at deepmind.com

deep learning large language models interpretability


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Progress measures for grokking via mechanistic interpretability N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt arXiv preprint arXiv:2301.05217, 2023	328	2023
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023	61	2023
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... arXiv preprint arXiv:2408.05147, 2024	47	2024
Improving dictionary learning with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... arXiv preprint arXiv:2404.16014, 2024	39	2024
Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024	28	2024
AtP*: An efficient and scalable method for localizing LLM behaviour to components J Kramár, T Lieberum, R Shah, N Nanda arXiv preprint arXiv:2403.00745, 2024	20	2024
Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, and Toby Shevlane M Phuong, M Aitchison, E Catt, S Cogan, A Kaskasoli, V Krakovna, ... Evaluating frontier models for dangerous capabilities, 2024	15	2024
Does Circuit Analysis Interpretability Scale T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik Evidence from Multiple Choice Capabilities in Chinchilla, 2023	10	2023
Retrospective on the 2021 minerl BASALT competition on learning from human feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022	9	2022
Improving dictionary learning with gated sparse autoencoders. 2024 S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kram’ar, ... URL https://api. semanticscholar. org/CorpusID 269362142, 0	6
Progress measures for grokking via mechanistic interpretability, Oct. 2023 N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt URL http://arxiv. org/abs/2301.05217, 0	5
Improving sparse decomposition of language model activations with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramar, ... The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024	2	2024
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... arXiv preprint arXiv:2204.07123, 2022	2	2022
Replication: Fairness without demographics through Adversarially Reweighted Learning E Jenner, T Lieberum, FP Nolte, N Rutsch

The system can't perform the operation now. Try again later.

Articles 1–14