Towards automated circuit discovery for mechanistic interpretability A Conmy, A Mavor-Parker, A Lynch, S Heimersheim, A Garriga-Alonso Advances in Neural Information Processing Systems 36, 16318-16352, 2023 | 255 | 2023 |
Causal machine learning: A survey and open problems J Kaddour, A Lynch, Q Liu, MJ Kusner, R Silva arXiv preprint arXiv:2206.15475, 2022 | 202 | 2022 |
Eight methods to evaluate robust unlearning in llms A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024 | 50 | 2024 |
Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... arXiv e-prints, arXiv: 2407.15549, 2024 | 32* | 2024 |
Spawrious: A benchmark for fine control of spurious correlation biases A Lynch, GJS Dovonon, J Kaddour, R Silva arXiv preprint arXiv:2303.05470, 2023 | 29* | 2023 |
Analysing the generalisation and reliability of steering vectors D Tan, D Chanin, A Lynch, B Paige, D Kanoulas, A Garriga-Alonso, R Kirk Advances in Neural Information Processing Systems 37, 139179-139212, 2025 | 9 | 2025 |
Best-of-N Jailbreaking J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, ... arXiv preprint arXiv:2412.03556, 2024 | 6* | 2024 |
Evaluating the impact of geometric and statistical skews on out-of-distribution generalization performance A Lynch, J Kaddour, R Silva NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and …, 2022 | 5 | 2022 |
H-Space Sparse Autoencoders A Ijishakin, ML Ang, L Baljer, DCH Tan, HL Fry, A Abdulaal, A Lynch, ... Neurips Safe Generative AI Workshop 2024, 2024 | 1 | 2024 |
How Do Large Language Monkeys Get Their Power (Laws)? R Schaeffer, J Kazdan, J Hughes, J Juravsky, S Price, A Lynch, E Jones, ... arXiv preprint arXiv:2502.17578, 2025 | | 2025 |
Plan B: Training LLMs to fail less severely J Stastny, N Warncke, D Xu, A Lynch, F Barez, H Sleight, E Perez | | 2024 |