A StrongREJECT for Empty Jailbreaks A Souly, Q Lu, D Bowen, T Trinh, E Hsieh, S Pandey, P Abbeel, ... arXiv preprint arXiv:2402.10260, 2024 | 51* | 2024 |
JaxMARL: Multi-Agent RL Environments and Algorithms in JAX A Rutherford, B Ellis, M Gallici, J Cook, A Lupu, G Ingvarsson, T Willi, ... Proceedings of the 23rd International Conference on Autonomous Agents and …, 2024 | 46* | 2024 |
Retrospective on the 2021 MineRL BASALT Competition on Learning from Human Feedback R Shah, SH Wang, C Wild, S Milani, A Kanervisto, VG Goecks, ... NeurIPS 2021 Competitions and Demonstrations Track, 259-272, 2022 | 11* | 2022 |
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ... arXiv preprint arXiv:2410.09024, 2024 | 7 | 2024 |
Leading the Pack: N-player Opponent Shaping A Souly, T Willi, A Khan, R Kirk, C Lu, E Grefenstette, T Rocktäschel arXiv preprint arXiv:2312.12564, 2023 | 3 | 2023 |
How to Evaluate Jailbreak Methods: A Case Study With the StrongREJECT Benchmark The paper in question claimed an impressive 43% success rate in jailbreaking GPT-4 by … D Bowen, S Emmons, A Souly, Q Lu, T Trinh, E Hsieh, S Pandey, ... | | |