Seuraa
Tom Lieberum
Tom Lieberum
Google DeepMind
Vahvistettu sähköpostiosoite verkkotunnuksessa deepmind.com
Nimike
Viittaukset
Viittaukset
Vuosi
Progress measures for grokking via mechanistic interpretability
N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt
arXiv preprint arXiv:2301.05217, 2023
3542023
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla
T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik
arXiv preprint arXiv:2307.09458, 2023
672023
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2
T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ...
arXiv preprint arXiv:2408.05147, 2024
622024
Improving dictionary learning with gated sparse autoencoders
S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ...
arXiv preprint arXiv:2404.16014, 2024
472024
Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders
S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ...
arXiv preprint arXiv:2407.14435, 2024
352024
Atp*: An efficient and scalable method for localizing llm behaviour to components
J Kramár, T Lieberum, R Shah, N Nanda
arXiv preprint arXiv:2403.00745, 2024
272024
Progress measures for grokking via mechanistic interpretability, 2023
N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt
URL https://arxiv. org/abs/2301.05217, 2023
262023
Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, and Toby Shevlane
M Phuong, M Aitchison, E Catt, S Cogan, A Kaskasoli, V Krakovna, ...
Evaluating frontier models for dangerous capabilities, 2024
162024
Does circuit analysis interpretability scale
T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik
Evidence from multiple choice capabilities in Chinchilla, 2023
132023
Retrospective on the 2021 minerl BASALT competition on learning from human feedback
R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ...
NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022
92022
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla, 2023
T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik
URL https://arxiv. org/abs/2307.09458, 0
7
Improving sparse decomposition of language model activations with gated sparse autoencoders
S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramar, ...
Advances in Neural Information Processing Systems 37, 775-818, 2025
22025
Retrospective on the 2021 BASALT competition on learning from human feedback
R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ...
arXiv preprint arXiv:2204.07123, 2022
22022
Progress measures for grokking via mechanistic interpretability, Oct. 2023
N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt
URL http://arxiv. org/abs/2301.05217, 0
2
Replication: Fairness without demographics through Adversarially Reweighted Learning
E Jenner, T Lieberum, FP Nolte, N Rutsch
Järjestelmä ei voi suorittaa toimenpidettä nyt. Yritä myöhemmin uudelleen.
Artikkelit 1–15