Sledovať
Xander Davies
Xander Davies
UK AI Security Institute
Overená e-mailová adresa na: dsit.gov.uk - Domovská stránka
Názov
Citované v
Citované v
Rok
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
4932023
Unifying grokking and double descent
X Davies, L Langosco, D Krueger
arXiv preprint arXiv:2303.06173, 2023
372023
Circuit breaking: Removing model behaviors with targeted ablation
M Li, X Davies, M Nadeau
arXiv preprint arXiv:2309.05973, 2023
202023
Sparse distributed memory is a continual learner
T Bricken, X Davies, D Singh, D Krotov, G Kreiman
arXiv preprint arXiv:2303.11934, 2023
202023
Agentharm: A benchmark for measuring harmfulness of llm agents
M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ...
arXiv preprint arXiv:2410.09024, 2024
162024
Discovering variable binding circuitry with desiderata
X Davies, M Nadeau, N Prakash, TR Shaham, D Bau
arXiv preprint arXiv:2307.03637, 2023
102023
Fundamental Limitations in Defending LLM Finetuning APIs
X Davies, E Winsor, T Korbak, A Souly, R Kirk, CS de Witt, Y Gal
arXiv preprint arXiv:2502.14828, 2025
2025
AgentHarm: Benchmarking Robustness of LLM Agents on Harmful Tasks
M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ...
The Thirteenth International Conference on Learning Representations, 0
Systém momentálne nemôže vykonať operáciu. Skúste to neskôr.
Články 1–8