Obserwuj
Alexandra Souly
Alexandra Souly
Zweryfikowany adres z ucl.ac.uk
Tytuł
Cytowane przez
Cytowane przez
Rok
A StrongREJECT for Empty Jailbreaks
A Souly, Q Lu, D Bowen, T Trinh, E Hsieh, S Pandey, P Abbeel, ...
arXiv preprint arXiv:2402.10260, 2024
51*2024
JaxMARL: Multi-Agent RL Environments and Algorithms in JAX
A Rutherford, B Ellis, M Gallici, J Cook, A Lupu, G Ingvarsson, T Willi, ...
Proceedings of the 23rd International Conference on Autonomous Agents and …, 2024
46*2024
Retrospective on the 2021 MineRL BASALT Competition on Learning from Human Feedback
R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ...
NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022
11*2022
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ...
arXiv preprint arXiv:2410.09024, 2024
72024
Leading the Pack: N-player Opponent Shaping
A Souly, T Willi, A Khan, R Kirk, C Lu, E Grefenstette, T Rocktäschel
arXiv preprint arXiv:2312.12564, 2023
32023
How to Evaluate Jailbreak Methods: A Case Study With the StrongREJECT Benchmark The paper in question claimed an impressive 43% success rate in jailbreaking GPT-4 by …
D Bowen, S Emmons, A Souly, Q Lu, T Trinh, E Hsieh, S Pandey, ...
Nie można teraz wykonać tej operacji. Spróbuj ponownie później.
Prace 1–6