Follow
Henry Sleight
Henry Sleight
Research Manager, Anthropic Fellows Program, Program Manager, Constellation
Verified email at constellation.org - Homepage
Title
Cited by
Cited by
Year
Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data
M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ...
arXiv preprint arXiv:2404.01413, 2024
442024
Targeted latent adversarial training improves robustness to persistent harmful behaviors in llms
A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ...
arXiv e-prints, arXiv: 2407.15549, 2024
182024
Latent adversarial training improves robustness to persistent harmful behaviors in llms
A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ...
arXiv preprint arXiv:2407.15549, 2024
92024
When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ...
arXiv preprint arXiv:2407.15211, 2024
82024
Is model collapse inevitable
M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ...
Breaking the curse of recursion by accumulating real and synthetic data …, 2024
52024
When do universal image jailbreaks transfer between vision-language models?, 2024
R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ...
URL https://arxiv. org/abs/2407.15211, 0
5
Looking inward: Language models can learn about themselves by introspection, 2024
FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, ...
URL https://arxiv. org/abs/2410.13787, 0
5
Is model collapse inevitable? breaking the curse of recursion by accumulating real and synthetic data, 2024
M Gerstgrasser, R Schaeffer, A Dey, R Rafailov, H Sleight, J Hughes, ...
URL https://arxiv. org/abs/2404.01413, 0
5
Best-of-n jailbreaking
J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, ...
arXiv preprint arXiv:2412.03556, 2024
32024
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, ...
arXiv preprint arXiv:2412.02159, 2024
22024
Looking Inward: Language Models Can Learn About Themselves by Introspection
FJ Binder, J Chua, T Korbak, H Sleight, J Hughes, R Long, E Perez, ...
arXiv preprint arXiv:2410.13787, 2024
22024
Failures to Find Transferable Image Jailbreaks Between Vision-Language Models
R Schaeffer, D Valentine, L Bailey, J Chua, C Eyzaguirre, Z Durante, ...
The Thirteenth International Conference on Learning Representations, 0
1
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
J Wen, V Hebbar, C Larson, A Bhatt, A Radhakrishnan, M Sharma, ...
arXiv preprint arXiv:2411.17693, 2024
2024
Rapid Response: Mitigating LLM Jailbreaks with a Few Examples
A Peng, J Michael, H Sleight, E Perez, M Sharma
arXiv preprint arXiv:2411.07494, 2024
2024
Attacking Audio Language Models with Best-of-N Jailbreaking
J Hughes, S Price, A Lynch, R Schaeffer, F Barez, S Koyejo, H Sleight, ...
Plan B: Training LLMs to fail less severely
J Stastny, N Warncke, D Xu, A Lynch, F Barez, H Sleight, E Perez
Jailbreak Defense in a Narrow Domain: Failures of existing methods and Improving Transcript-Based Classifiers
TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, ...
The Third Workshop on New Frontiers in Adversarial Machine Learning, 0
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs
A Ewart, A Sheshadri, PH Guo, A Lynch, C Wu, V Hebbar, H Sleight, ...
Workshop on Socially Responsible Language Modelling Research, 0
The system can't perform the operation now. Try again later.
Articles 1–18