Följ
Mikita Balesni
Mikita Balesni
Andra namnMykyta Baliesnyi
Research Scientist, Apollo Research
Verifierad e-postadress på apolloresearch.ai
Titel
Citeras av
Citeras av
År
The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A"
L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ...
arXiv preprint arXiv:2309.12288, 2023
223*2023
Taken out of context: On measuring situational awareness in LLMs
L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ...
arXiv preprint arXiv:2309.00667, 2023
57*2023
Large language models can strategically deceive their users when put under pressure
J Scheurer, M Balesni, M Hobbhahn
arXiv preprint arXiv:2311.07590, 2023
53*2023
Me, myself, and AI: The situational awareness dataset (SAD) for LLMs
R Laine, B Chughtai, J Betley, K Hariharan, J Scheurer, M Balesni, ...
arXiv preprint arXiv:2407.04694, 2024
192024
A causal framework for AI regulation and auditing
L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ...
Publisher: Preprints, 2024
16*2024
Frontier models are capable of in-context scheming
A Meinke, B Schoen, J Scheurer, M Balesni, R Shah, M Hobbhahn
arXiv preprint arXiv:2412.04984, 2024
72024
Towards evaluations-based safety cases for ai scheming
M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ...
arXiv preprint arXiv:2411.03336, 2024
72024
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. arXiv 2024
L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ...
arXiv preprint arXiv:2309.12288, 0
5
Controlling Steering with Energy-Based Models
M Balesni, A Tampuu, T Matiisen
arXiv preprint arXiv:2301.12264, 2023
22023
The Two-Hop Curse: LLMs trained on A-> B, B-> C fail to learn A--> C
M Balesni, T Korbak, O Evans
arXiv preprint arXiv:2411.16353, 2024
12024
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack
L McKee-Reid, C Sträter, MA Martinez, J Needham, M Balesni
arXiv preprint arXiv:2410.06491, 2024
12024
Systemet kan inte utföra åtgärden just nu. Försök igen senare.
Artiklar 1–11