Follow
Zhexin Zhang
Zhexin Zhang
Verified email at mails.tsinghua.edu.cn - Homepage
Title
Cited by
Cited by
Year
Safetybench: Evaluating the safety of large language models with multiple choice questions
Z Zhang, L Lei, L Wu, R Sun, Y Huang, C Long, X Liu, X Lei, J Tang, ...
arXiv preprint arXiv:2309.07045, 2023
1292023
Safety assessment of chinese large language models
H Sun, Z Zhang, J Deng, J Cheng, M Huang
arXiv preprint arXiv:2304.10436, 2023
1182023
Defending large language models against jailbreaking attacks through goal prioritization
Z Zhang, J Yang, P Ke, M Huang
ACL 2024, 2023
662023
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
J Guan, Z Zhang, Z Feng, Z Liu, W Ding, X Mao, C Fan, M Huang
ACL 2021, 2021
532021
Unveiling the implicit toxicity in large language models
J Wen, P Ke, H Sun, Z Zhang, C Li, J Bai, M Huang
EMNLP 2023, 2023
522023
Recent advances towards safe, responsible, and moral dialogue systems: A survey
J Deng, H Sun, Z Zhang, J Cheng, M Huang
arXiv preprint arXiv:2302.09270 1, 2023
42*2023
Ethicist: Targeted training data extraction through loss smoothed soft prompting and calibrated confidence estimation
Z Zhang, J Wen, M Huang
ACL 2023, 2023
262023
Persona-Guided Planning for Controlling the Protagonist's Persona in Story Generation
Z Zhang, J Wen, J Guan, M Huang
NAACL 2022, 2022
232022
MoralDial: A framework to train and evaluate moral dialogue systems via moral discussions
H Sun, Z Zhang, F Mi, Y Wang, W Liu, J Cui, B Wang, Q Liu, M Huang
ACL 2023, 2022
212022
Shieldlm: Empowering llms as aligned, customizable and explainable safety detectors
Z Zhang, Y Lu, J Ma, D Zhang, R Li, P Ke, H Sun, L Sha, Z Sui, H Wang, ...
arXiv preprint arXiv:2402.16444, 2024
162024
Safe unlearning: A surprisingly effective and generalizable solution to defend against jailbreak attacks
Z Zhang, J Yang, P Ke, S Cui, C Zheng, H Wang, M Huang
arXiv preprint arXiv:2407.02855, 2024
142024
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
Z Zhang, J Cheng, H Sun, J Deng, F Mi, Y Wang, L Shang, M Huang
EMNLP 2022 Findings, 2022
112022
Automatic comment generation for Chinese student narrative essays
Z Zhang, J Guan, G Xu, Y Tian, M Huang
Proceedings of the 2022 Conference on Empirical Methods in Natural Language …, 2022
82022
Selecting Stickers in Open-Domain Dialogue through Multitask Learning
Z Zhang, Y Zhu, Z Fei, J Zhang, J Zhou
ACL 2022 Findings, 2022
62022
InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
Z Zhang, J Cheng, H Sun, J Deng, M Huang
Findings of the Association for Computational Linguistics: EMNLP 2023, 10421 …, 2023
52023
Enhancing Offensive Language Detection with Data Augmentation and Knowledge Distillation
J Deng, Z Chen, H Sun, Z Zhang, J Wu, S Nakagawa, F Ren, M Huang
Research 6, 0189, 2023
52023
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
S Tu, Z Pan, W Wang, Z Zhang, Y Sun, J Yu, H Wang, L Hou, J Li
arXiv preprint arXiv:2406.11682, 2024
12024
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Z Zhang, S Cui, Y Lu, J Zhou, J Yang, H Wang, M Huang
arXiv preprint arXiv:2412.14470, 2024
2024
Self-Supervised Sentence Polishing by Adding Engaging Modifiers
Z Zhang, J Guan, X Cui, Y Ran, B Liu, M Huang
Proceedings of the 61st Annual Meeting of the Association for Computational …, 2023
2023
The system can't perform the operation now. Try again later.
Articles 1–19