Follow
Tinghao Xie
Title
Cited by
Cited by
Year
Fine-tuning aligned language models compromises safety, even when users do not intend to!
X Qi, Y Zeng, T Xie, PY Chen, R Jia, P Mittal, P Henderson
ICLR 2024 (Oral), 2023
4132023
Revisiting the assumption of latent separability for backdoor defenses
X Qi, T Xie, Y Li, S Mahloujifar, P Mittal
ICLR 2023, 2022
113*2022
Towards practical deployment-stage backdoor attack on deep neural networks
X Qi, T Xie, R Pan, J Zhu, Y Yang, K Bu
CVPR 2022 (Oral), 13347-13357, 2022
692022
Assessing the brittleness of safety alignment via pruning and low-rank modifications
B Wei, K Huang, Y Huang, T Xie, X Qi, M Xia, P Mittal, M Wang, ...
ICML 2024, 2024
672024
Towards a proactive {ML} approach for detecting backdoor poison samples
X Qi, T Xie, JT Wang, T Wu, S Mahloujifar, P Mittal
32nd USENIX Security Symposium (USENIX Security 23), 1685-1702, 2023
462023
Sorry-bench: Systematically evaluating large language model safety refusal behaviors
T Xie, X Qi, Y Zeng, Y Huang, UM Sehwag, K Huang, L He, B Wei, D Li, ...
ICLR 2025, 2024
282024
AI Risk Management Should Incorporate Both Safety and Security
X Qi, Y Huang, Y Zeng, E Debenedetti, J Geiping, L He, K Huang, ...
arXiv preprint arXiv:2405.19524, 2024
122024
Fantastic Copyrighted Beasts and How (Not) to Generate Them
L He, Y Huang, W Shi, T Xie, H Liu, Y Wang, L Zettlemoyer, C Zhang, ...
ICLR 2025, 2024
92024
BaDExpert: Extracting Backdoor Functionality for Accurate Backdoor Input Detection
T Xie, X Qi, P He, Y Li, JT Wang, P Mittal
ICLR 2024, 2023
52023
On evaluating the durability of safeguards for open-weight llms
X Qi, B Wei, N Carlini, Y Huang, T Xie, L He, M Jagielski, M Nasr, P Mittal, ...
ICLR 2025, 2024
32024
The system can't perform the operation now. Try again later.
Articles 1–10