Seuraa
Bilal Chughtai
Bilal Chughtai
Google DeepMind
Vahvistettu sähköpostiosoite verkkotunnuksessa google.com - Kotisivu
Nimike
Viittaukset
Viittaukset
Vuosi
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
B Chughtai, L Chan, N Nanda
ICML 2023, ICLR 2023 Workshop on Physics for Machine Learning (Spotlight), 2023
912023
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs
R Laine, B Chughtai, J Betley, K Hariharan, J Scheurer, M Balesni, ...
NeurIPS 2024 Datasets and Benchmarks Track, 2024
192024
Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs
B Chughtai, A Cooney, N Nanda
NeurIPS 2023 Attributing Model Behaviour At Scale Workshop, 2024
102024
Towards evaluations-based safety cases for ai scheming
M Balesni, M Hobbhahn, D Lindner, A Meinke, T Korbak, J Clymer, ...
arXiv preprint arXiv:2411.03336, 2024
72024
Transformer Circuit Evaluation Metrics are not Robust
J Miller, B Chughtai, W Saunders
COLM 2024 Oral, 2024
6*2024
Open Problems in Mechanistic Interpretability
L Sharkey, B Chughtai, J Batson, J Lindsey, J Wu, L Bushnaq, ...
arXiv preprint arXiv:2501.16496, 2025
32025
Detecting Strategic Deception Using Linear Probes
N Goldowsky-Dill, B Chughtai, S Heimersheim, M Hobbhahn
arXiv preprint arXiv:2502.03407, 2025
2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Z Che, S Casper, R Kirk, A Satheesh, S Slocum, LE McKinney, ...
arXiv preprint arXiv:2502.05209, 2025
2025
Can Language Models Explain Their Own Classification Behavior?
D Sherburn, B Chughtai, O Evans
arXiv preprint arXiv:2405.07436, 2024
2024
Järjestelmä ei voi suorittaa toimenpidettä nyt. Yritä myöhemmin uudelleen.
Artikkelit 1–9