Progress measures for grokking via mechanistic interpretability N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt arXiv preprint arXiv:2301.05217, 2023 | 328 | 2023 |
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023 | 61 | 2023 |
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... arXiv preprint arXiv:2408.05147, 2024 | 47 | 2024 |
Improving dictionary learning with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... arXiv preprint arXiv:2404.16014, 2024 | 39 | 2024 |
Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024 | 28 | 2024 |
AtP*: An efficient and scalable method for localizing LLM behaviour to components J Kramár, T Lieberum, R Shah, N Nanda arXiv preprint arXiv:2403.00745, 2024 | 20 | 2024 |
Seliem El-Sayed, Sasha Brown, Anca Dragan, Rohin Shah, Allan Dafoe, and Toby Shevlane M Phuong, M Aitchison, E Catt, S Cogan, A Kaskasoli, V Krakovna, ... Evaluating frontier models for dangerous capabilities, 2024 | 15 | 2024 |
Does Circuit Analysis Interpretability Scale T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik Evidence from Multiple Choice Capabilities in Chinchilla, 2023 | 10 | 2023 |
Retrospective on the 2021 minerl BASALT competition on learning from human feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022 | 9 | 2022 |
Improving dictionary learning with gated sparse autoencoders. 2024 S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kram’ar, ... URL https://api. semanticscholar. org/CorpusID 269362142, 0 | 6 | |
Progress measures for grokking via mechanistic interpretability, Oct. 2023 N Nanda, L Chan, T Lieberum, J Smith, J Steinhardt URL http://arxiv. org/abs/2301.05217, 0 | 5 | |
Improving sparse decomposition of language model activations with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramar, ... The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 | 2 | 2024 |
Retrospective on the 2021 BASALT Competition on Learning from Human Feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... arXiv preprint arXiv:2204.07123, 2022 | 2 | 2022 |
Replication: Fairness without demographics through Adversarially Reweighted Learning E Jenner, T Lieberum, FP Nolte, N Rutsch | | |