Sorry-bench: Systematically evaluating large language model safety refusal behaviors T Xie, X Qi, Y Zeng, Y Huang, UM Sehwag, K Huang, L He, B Wei, D Li, ... ICLR 2025, 2024 | 28 | 2024 |
What is in Your Safe Data? Identifying Benign Data that Breaks Safety L He, M Xia, P Henderson COLM 2024, 2024 | 24 | 2024 |
Charxiv: Charting gaps in realistic chart understanding in multimodal llms Z Wang, M Xia, L He, H Chen, Y Liu, R Zhu, K Liang, X Wu, H Liu, ... NeurIPS 2024 Dataset & Benchmark, 2024 | 21 | 2024 |
Aleatoric and epistemic discrimination: Fundamental limits of fairness interventions H Wang, L He, R Gao, F Calmon Advances in Neural Information Processing Systems 36, 2024 | 16 | 2024 |
AI Risk Management Should Incorporate Both Safety and Security X Qi, Y Huang, Y Zeng, E Debenedetti, J Geiping, L He, K Huang, ... arXiv preprint arXiv:2405.19524, 2024 | 12 | 2024 |
Fantastic Copyrighted Beasts and How (Not) to Generate Them L He, Y Huang, W Shi, T Xie, H Liu, Y Wang, L Zettlemoyer, C Zhang, ... ICLR 2025, 2024 | 9 | 2024 |
Sorry-bench: Systematically evaluating large language model safety refusal behaviors, 2024 T Xie, X Qi, Y Zeng, Y Huang, UM Sehwag, K Huang, L He, B Wei, D Li, ... URL https://arxiv. org/abs/2406.14598, 0 | 5 | |
On evaluating the durability of safeguards for open-weight llms X Qi, B Wei, N Carlini, Y Huang, T Xie, L He, M Jagielski, M Nasr, P Mittal, ... ICLR 2025, 2024 | 4 | 2024 |
Metadata Conditioning Accelerates Language Model Pre-training T Gao, A Wettig, L He, Y Dong, S Malladi, D Chen arXiv preprint arXiv:2501.01956, 2025 | 1 | 2025 |
Cascaded to End-to-End: New Safety, Security, and Evaluation Questions for Audio Language Models L He, X Qi, I Cheong, PMD Chen, P Henderson | | |