Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ... arXiv preprint arXiv:2404.01413, 2024 | 44 | 2024 |
Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... arXiv e-prints, arXiv: 2407.15549, 2024 | 18 | 2024 |
Latent adversarial training improves robustness to persistent harmful behaviors in llms A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... arXiv preprint arXiv:2407.15549, 2024 | 9 | 2024 |
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models? R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ... arXiv preprint arXiv:2407.15211, 2024 | 8 | 2024 |
Is model collapse inevitable M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ... Breaking the curse of recursion by accumulating real and synthetic data …, 2024 | 5 | 2024 |
When do universal image jailbreaks transfer between vision-language models?, 2024 R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ... URL https://arxiv. org/abs/2407.15211, 0 | 5 | |
Looking inward: Language models can learn about themselves by introspection, 2024 FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, ... URL https://arxiv. org/abs/2410.13787, 0 | 5 | |
Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data, 2024 M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ... URL https://arxiv. org/abs/2404.01413, 0 | 5 | |
Best-of-n jailbreaking J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, ... arXiv preprint arXiv:2412.03556, 2024 | 3 | 2024 |
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, ... arXiv preprint arXiv:2412.02159, 2024 | 2 | 2024 |
Looking Inward: Language Models Can Learn About Themselves by Introspection FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, ... arXiv preprint arXiv:2410.13787, 2024 | 2 | 2024 |
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ... The Thirteenth International Conference on Learning Representations, 0 | 1 | |
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats J Wen, V Hebbar, C Larson, A Bhatt, A Radhakrishnan, M Sharma, ... arXiv preprint arXiv:2411.17693, 2024 | | 2024 |
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples A Peng, J Michael, H Sleight, E Perez, M Sharma arXiv preprint arXiv:2411.07494, 2024 | | 2024 |
Attacking Audio Language Models with Best-of-N Jailbreaking J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, ... | | |
Plan B: Training LLMs to fail less severely J Stastny, N Warncke, D Xu, A Lynch, F Barez, H Sleight, E Perez | | |
Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, ... The Third Workshop on New Frontiers in Adversarial Machine Learning, 0 | | |
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs A Ewart, A Sheshadri, PH Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... Workshop on Socially Responsible Language Modelling Research, 0 | | |