‪Tom Lieberum‬ - ‪Academic Search‬

Hanki oma profiili

Viittaukset

	Kaikki	2020 lähtien
Sitaatit	669	669
h-indeksi	9	9
i10-indeksi	9	9

0

420

210

105

315

20222023202420253 129 416 121

Tom Lieberum

Tom Lieberum

Google DeepMind

Vahvistettu sähköpostiosoite verkkotunnuksessa deepmind.com

deep learning large language models interpretability


Nimike Lajittele sitaattien mukaan Lajittele vuoden mukaan Lajittele otsikon mukaan	Viittaukset Viittaukset	Vuosi
Progress measures for grokking via mechanistic interpretability N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt arXiv preprint arXiv:2301.05217, 2023	354	2023
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023	67	2023
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... arXiv preprint arXiv:2408.05147, 2024	62	2024
Improving dictionary learning with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... arXiv preprint arXiv:2404.16014, 2024	47	2024
Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024	35	2024
Atp*: An efficient and scalable method for localizing llm behaviour to components J Kramár, T Lieberum, R Shah, N Nanda arXiv preprint arXiv:2403.00745, 2024	27	2024
Progress measures for grokking via mechanistic interpretability, 2023 N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt URL https://arxiv. org/abs/2301.05217, 2023	26	2023
Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, and Toby Shevlane M Phuong, M Aitchison, E Catt, S Cogan, A Kaskasoli, V Krakovna, ... Evaluating frontier models for dangerous capabilities, 2024	16	2024
Does circuit analysis interpretability scale T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik Evidence from multiple choice capabilities in Chinchilla, 2023	13	2023
Retrospective on the 2021 minerl BASALT competition on learning from human feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022	9	2022
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023 T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik URL https://arxiv. org/abs/2307.09458, 0	7
Improving sparse decomposition of language model activations with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramar, ... Advances in Neural Information Processing Systems 37, 775-818, 2025	2	2025
Retrospective on the 2021 BASALT competition on learning from human feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... arXiv preprint arXiv:2204.07123, 2022	2	2022
Progress measures for grokking via mechanistic interpretability, Oct. 2023 N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt URL http://arxiv. org/abs/2301.05217, 0	2
Replication: Fairness without demographics through Adversarially Reweighted Learning E Jenner, T Lieberum, FP Nolte, N Rutsch

Järjestelmä ei voi suorittaa toimenpidettä nyt. Yritä myöhemmin uudelleen.

Artikkelit 1–15