Mikita Balesni

Citeras av

	Alla	Sedan 2020
Citat	391	389
h-index	7	7
i10-index	5	5

280

140

210

20222023202420251 46 271 69

Medförfattare

Lukas BerglundU.S. AI Safety InstituteVerifierad e-postadress på vanderbilt.edu
Jérémy ScheurerApollo ResearchVerifierad e-postadress på apolloresearch.ai
Owain EvansAssociate, CHAI, UC BerkeleyVerifierad e-postadress på philosophy.ox.ac.uk
Tomek KorbakUK AI Safety InstituteVerifierad e-postadress på dsit.gov.uk
Meg TongAnthropicVerifierad e-postadress på anthropic.com
Asa Cooper SticklandPostdoctoral Researcher, New York UniversityVerifierad e-postadress på ed.ac.uk
Max KaufmannUniversity of Toronto / Vector InstituteVerifierad e-postadress på cs.toronto.edu
Lee D SharkeyApollo ResearchVerifierad e-postadress på apolloresearch.ai
Joshua ClymerColumbia UniversityVerifierad e-postadress på columbia.edu

Följ

Mikita Balesni

Andra namnMykyta Baliesnyi

Research Scientist, Apollo Research

Verifierad e-postadress på apolloresearch.ai

large language models artificial intelligence safety


Titel Sortera efter citat Sortera efter år Sortera efter titel	Citeras av Citeras av	År
The Reversal Curse: LLMs trained on" A is B" fail to learn" B is A" L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 2023	223*	2023
Taken out of context: On measuring situational awareness in LLMs L Berglund, AC Stickland, M Balesni, M Kaufmann, M Tong, T Korbak, ... arXiv preprint arXiv:2309.00667, 2023	57*	2023
Large language models can strategically deceive their users when put under pressure J Scheurer, M Balesni, M Hobbhahn arXiv preprint arXiv:2311.07590, 2023	53*	2023
Me, myself, and AI: The situational awareness dataset (SAD) for LLMs R Laine, B Chughtai, J Betley, K Hariharan, J Scheurer, M Balesni, ... arXiv preprint arXiv:2407.04694, 2024	19	2024
A causal framework for AI regulation and auditing L Sharkey, CN Ghuidhir, D Braun, J Scheurer, M Balesni, L Bushnaq, ... Publisher: Preprints, 2024	16*	2024
Frontier models are capable of in-context scheming A Meinke, B Schoen, J Scheurer, M Balesni, R Shah, M Hobbhahn arXiv preprint arXiv:2412.04984, 2024	7	2024
Towards evaluations-based safety cases for ai scheming M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ... arXiv preprint arXiv:2411.03336, 2024	7	2024
The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A”. arXiv 2024 L Berglund, M Tong, M Kaufmann, M Balesni, AC Stickland, T Korbak, ... arXiv preprint arXiv:2309.12288, 0	5
Controlling Steering with Energy-Based Models M Balesni, A Tampuu, T Matiisen arXiv preprint arXiv:2301.12264, 2023	2	2023
The Two-Hop Curse: LLMs trained on A-> B, B-> C fail to learn A--> C M Balesni, T Korbak, O Evans arXiv preprint arXiv:2411.16353, 2024	1	2024
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack L McKee-Reid, C Sträter, MA Martinez, J Needham, M Balesni arXiv preprint arXiv:2410.06491, 2024	1	2024

Systemet kan inte utföra åtgärden just nu. Försök igen senare.

Artiklar 1–11

Citat per år

Dubblettcitat

Sammanfogade citat

Lägg till medförfattareMedförfattare

Följ

Citeras av

Medförfattare