Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023 | 471 | 2023 |
Unifying grokking and double descent X Davies, L Langosco, D Krueger arXiv preprint arXiv:2303.06173, 2023 | 35 | 2023 |
Sparse distributed memory is a continual learner T Bricken, X Davies, D Singh, D Krotov, G Kreiman arXiv preprint arXiv:2303.11934, 2023 | 20 | 2023 |
Circuit breaking: Removing model behaviors with targeted ablation M Li, X Davies, M Nadeau arXiv preprint arXiv:2309.05973, 2023 | 19 | 2023 |
Discovering variable binding circuitry with desiderata X Davies, M Nadeau, N Prakash, TR Shaham, D Bau arXiv preprint arXiv:2307.03637, 2023 | 9 | 2023 |
Agentharm: A benchmark for measuring harmfulness of llm agents M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ... arXiv preprint arXiv:2410.09024, 2024 | 7 | 2024 |