Zoneout: Regularizing rnns by randomly preserving hidden activations D Krueger, T Maharaj, J Kramár, M Pezeshki, N Ballas, NR Ke, A Goyal, ... arXiv preprint arXiv:1606.01305, 2016 | 393 | 2016 |
Reinforcement and imitation learning for diverse visuomotor skills Y Zhu, Z Wang, J Merel, A Rusu, T Erez, S Cabi, S Tunyasuvunakool, ... arXiv preprint arXiv:1802.09564, 2018 | 382 | 2018 |
OpenSpiel: A framework for reinforcement learning in games M Lanctot, E Lockhart, JB Lespiau, V Zambaldi, S Upadhyay, J Pérolat, ... arXiv preprint arXiv:1908.09453, 2019 | 301 | 2019 |
Guidelines for artificial intelligence containment J Babcock, J Kramar, RV Yampolskiy Next-Generation Ethics: Engineering a Better Society (Ed.) Ali. E. Abbas, 90-112, 2019 | 68 | 2019 |
Tracr: Compiled transformers as a laboratory for interpretability D Lindner, J Kramár, S Farquhar, M Rahtz, T McGrath, V Mikulik Advances in Neural Information Processing Systems 36, 2024 | 66 | 2024 |
The AGI containment problem J Babcock, J Kramár, R Yampolskiy Artificial General Intelligence: 9th International Conference, AGI 2016, New …, 2016 | 65 | 2016 |
Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla T Lieberum, M Rahtz, J Kramár, N Nanda, G Irving, R Shah, V Mikulik arXiv preprint arXiv:2307.09458, 2023 | 63 | 2023 |
Learning to play no-press diplomacy with best response policy iteration T Anthony, T Eccles, A Tacchetti, J Kramár, I Gemp, T Hudson, N Porcel, ... Advances in Neural Information Processing Systems 33, 17987-18003, 2020 | 58 | 2020 |
Learning reciprocity in complex sequential social dilemmas T Eccles, E Hughes, J Kramár, S Wheelwright, JZ Leibo arXiv preprint arXiv:1903.08082, 2019 | 52 | 2019 |
Negotiation and honesty in artificial intelligence methods for the board game of Diplomacy J Kramár, T Eccles, I Gemp, A Tacchetti, KR McKee, M Malinowski, ... Nature Communications 13 (1), 7214, 2022 | 51 | 2022 |
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2 T Lieberum, S Rajamanoharan, A Conmy, L Smith, N Sonnerat, V Varma, ... arXiv preprint arXiv:2408.05147, 2024 | 49 | 2024 |
Explaining grokking through circuit efficiency V Varma, R Shah, Z Kenton, J Kramár, R Kumar arXiv preprint arXiv:2309.02390, 2023 | 44 | 2023 |
The hydra effect: Emergent self-repair in language model computations T McGrath, M Rahtz, J Kramar, V Mikulik, S Legg arXiv preprint arXiv:2307.15771, 2023 | 40 | 2023 |
Improving dictionary learning with gated sparse autoencoders S Rajamanoharan, A Conmy, L Smith, T Lieberum, V Varma, J Kramár, ... arXiv preprint arXiv:2404.16014, 2024 | 39 | 2024 |
Reinforcement and imitation learning for a task S Tunyasuvunakool, Y Zhu, J Merel, J Kramar, Z Wang, NMO Heess US Patent App. 16/174,112, 2019 | 35 | 2019 |
Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders S Rajamanoharan, T Lieberum, N Sonnerat, A Conmy, V Varma, J Kramár, ... arXiv preprint arXiv:2407.14435, 2024 | 29 | 2024 |
OpenSpiel: a framework for reinforcement learning in games. CoRR abs/1908.09453 (2019) M Lanctot, E Lockhart, JB Lespiau, V Zambaldi, S Upadhyay, J Pérolat, ... arXiv preprint arXiv:1908.09453, 2019 | 29 | 2019 |
AtP*: An efficient and scalable method for localizing LLM behaviour to components J Kramár, T Lieberum, R Shah, N Nanda arXiv preprint arXiv:2403.00745, 2024 | 21 | 2024 |
Sample-based approximation of Nash in large many-player games via gradient descent I Gemp, R Savani, M Lanctot, Y Bachrach, T Anthony, R Everett, ... arXiv preprint arXiv:2106.01285, 2021 | 20 | 2021 |
On scalable oversight with weak llms judging strong llms Z Kenton, NY Siegel, J Kramár, J Brown-Cohen, S Albanie, J Bulian, ... arXiv preprint arXiv:2407.04622, 2024 | 13 | 2024 |