Folgen
Alexandra Souly
Alexandra Souly
Bestätigte E-Mail-Adresse bei ucl.ac.uk
Titel
Zitiert von
Zitiert von
Jahr
A StrongREJECT for Empty Jailbreaks
A Souly, Q Lu, D Bowen, T Trinh, E Hsieh, S Pandey, P Abbeel, ...
arXiv preprint arXiv:2402.10260, 2024
53*2024
JaxMARL: Multi-Agent RL Environments and Algorithms in JAX
A Rutherford, B Ellis, M Gallici, J Cook, A Lupu, G Ingvarsson, T Willi, ...
Proceedings of the 23rd International Conference on Autonomous Agents and …, 2024
49*2024
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ...
arXiv preprint arXiv:2410.09024, 2024
112024
Retrospective on the 2021 MineRL BASALT Competition on Learning from Human Feedback
R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ...
NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022
11*2022
Leading the Pack: N-player Opponent Shaping
A Souly, T Willi, A Khan, R Kirk, C Lu, E Grefenstette, T Rocktäschel
arXiv preprint arXiv:2312.12564, 2023
32023
How to Evaluate Jailbreak Methods: A Case Study With the StrongREJECT Benchmark The paper in question claimed an impressive 43% success rate in jailbreaking GPT-4 by …
D Bowen, S Emmons, A Souly, Q Lu, T Trinh, E Hsieh, S Pandey, ...
Das System kann den Vorgang jetzt nicht ausführen. Versuchen Sie es später erneut.
Artikel 1–6