Zhexin Zhang

Cited by

	All	Since 2020
Citations	596	596
h-index	11	11
i10-index	12	12

440

220

110

330

202120222023202420252 15 111 424 42

Public access

View all

3 articles

0 articles

available

not available

Based on funding mandates

Zhexin Zhang

Tsinghua University, CoAI Group

Verified email at mails.tsinghua.edu.cn - Homepage

NLP AI Safety & Alignment


Title Sort by citations Sort by year Sort by title	Cited by Cited by	Year
Safetybench: Evaluating the safety of large language models with multiple choice questions Z Zhang, L Lei, L Wu, R Sun, Y Huang, C Long, X Liu, X Lei, J Tang, ... arXiv preprint arXiv:2309.07045, 2023	129	2023
Safety assessment of chinese large language models H Sun, Z Zhang, J Deng, J Cheng, M Huang arXiv preprint arXiv:2304.10436, 2023	118	2023
Defending large language models against jailbreaking attacks through goal prioritization Z Zhang, J Yang, P Ke, M Huang ACL 2024, 2023	66	2023
OpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics J Guan, Z Zhang, Z Feng, Z Liu, W Ding, X Mao, C Fan, M Huang ACL 2021, 2021	53	2021
Unveiling the implicit toxicity in large language models J Wen, P Ke, H Sun, Z Zhang, C Li, J Bai, M Huang EMNLP 2023, 2023	52	2023
Recent advances towards safe, responsible, and moral dialogue systems: A survey J Deng, H Sun, Z Zhang, J Cheng, M Huang arXiv preprint arXiv:2302.09270 1, 2023	42*	2023
Ethicist: Targeted training data extraction through loss smoothed soft prompting and calibrated confidence estimation Z Zhang, J Wen, M Huang ACL 2023, 2023	26	2023
Persona-Guided Planning for Controlling the Protagonist's Persona in Story Generation Z Zhang, J Wen, J Guan, M Huang NAACL 2022, 2022	23	2022
MoralDial: A framework to train and evaluate moral dialogue systems via moral discussions H Sun, Z Zhang, F Mi, Y Wang, W Liu, J Cui, B Wang, Q Liu, M Huang ACL 2023, 2022	21	2022
Shieldlm: Empowering llms as aligned, customizable and explainable safety detectors Z Zhang, Y Lu, J Ma, D Zhang, R Li, P Ke, H Sun, L Sha, Z Sui, H Wang, ... arXiv preprint arXiv:2402.16444, 2024	16	2024
Safe unlearning: A surprisingly effective and generalizable solution to defend against jailbreak attacks Z Zhang, J Yang, P Ke, S Cui, C Zheng, H Wang, M Huang arXiv preprint arXiv:2407.02855, 2024	14	2024
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation Z Zhang, J Cheng, H Sun, J Deng, F Mi, Y Wang, L Shang, M Huang EMNLP 2022 Findings, 2022	11	2022
Automatic comment generation for Chinese student narrative essays Z Zhang, J Guan, G Xu, Y Tian, M Huang Proceedings of the 2022 Conference on Empirical Methods in Natural Language …, 2022	8	2022
Selecting Stickers in Open-Domain Dialogue through Multitask Learning Z Zhang, Y Zhu, Z Fei, J Zhang, J Zhou ACL 2022 Findings, 2022	6	2022
InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning Z Zhang, J Cheng, H Sun, J Deng, M Huang Findings of the Association for Computational Linguistics: EMNLP 2023, 10421 …, 2023	5	2023
Enhancing Offensive Language Detection with Data Augmentation and Knowledge Distillation J Deng, Z Chen, H Sun, Z Zhang, J Wu, S Nakagawa, F Ren, M Huang Research 6, 0189, 2023	5	2023
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack S Tu, Z Pan, W Wang, Z Zhang, Y Sun, J Yu, H Wang, L Hou, J Li arXiv preprint arXiv:2406.11682, 2024	1	2024
Agent-SafetyBench: Evaluating the Safety of LLM Agents Z Zhang, S Cui, Y Lu, J Zhou, J Yang, H Wang, M Huang arXiv preprint arXiv:2412.14470, 2024		2024
Self-Supervised Sentence Polishing by Adding Engaging Modifiers Z Zhang, J Guan, X Cui, Y Ran, B Liu, M Huang Proceedings of the 61st Annual Meeting of the Association for Computational …, 2023		2023

The system can't perform the operation now. Try again later.

Articles 1–19

Citations per year

Duplicate citations

Merged citations

Add co-authorsCo-authors

Follow

Cited by