The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023 | 223* | 2023 |
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023 | 57* | 2023 |
Large language models can strategically deceive their users when put under pressure J Scheurer, M Balesni, M Hobbhahn arXiv preprint arXiv:2311.07590, 2023 | 53* | 2023 |
Me, myself, and AI: The situational awareness dataset (SAD) for LLMs R Laine, B Chughtai, J Betley, K Hariharan, J Scheurer, M Balesni, ... arXiv preprint arXiv:2407.04694, 2024 | 19 | 2024 |
A causal framework for AI regulation and auditing L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ... Publisher: Preprints, 2024 | 16* | 2024 |
Frontier models are capable of in-context scheming A Meinke, B Schoen, J Scheurer, M Balesni, R Shah, M Hobbhahn arXiv preprint arXiv:2412.04984, 2024 | 7 | 2024 |
Towards evaluations-based safety cases for ai scheming M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ... arXiv preprint arXiv:2411.03336, 2024 | 7 | 2024 |
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. arXiv 2024 L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 0 | 5 | |
Controlling Steering with Energy-Based Models M Balesni, A Tampuu, T Matiisen arXiv preprint arXiv:2301.12264, 2023 | 2 | 2023 |
The Two-Hop Curse: LLMs trained on A-> B, B-> C fail to learn A--> C M Balesni, T Korbak, O Evans arXiv preprint arXiv:2411.16353, 2024 | 1 | 2024 |
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack L McKee-Reid, C Sträter, MA Martinez, J Needham, M Balesni arXiv preprint arXiv:2410.06491, 2024 | 1 | 2024 |