Follow
Nathaniel Li
Nathaniel Li
Scale AI, Center for AI Safety
Verified email at berkeley.edu - Homepage
Title
Cited by
Cited by
Year
Representation Engineering: A Top-Down Approach to AI Transparency
A Zou, L Phan*, S Chen*, J Campbell*, P Guo*, R Ren*, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
317*2023
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...
ICML 2024, 2024
190*2024
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
A Pan*, JS Chan*, A Zou*, N Li, S Basart, T Woodside, H Zhang, ...
ICML 2023 (Oral), 26837-26867, 2023
1362023
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
N Li*, A Pan*, A Gopal†, S Yue†, D Berrios†, A Gatti‡, JD Li‡, ...
ICML 2024, 2024
1052024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
N Li, Z Han, I Steneker, W Primack, R Goodside, H Zhang, Z Wang, ...
NeurIPS 2024 Red Teaming Workshop (Oral), 2024
252024
Humanity's Last Exam
L Phan*, A Gatti*, Z Han*, N Li*, J Hu, H Zhang, S Shi, M Choi, A Agrawal, ...
arXiv preprint arXiv:2501.14249, 2025
2025
Robustness Evaluation of Proxy Models Against Adversarial Optimization
A Zou, L Phan, N Li, JS Chan, M Mazeika, A O'Gara, S Basart, J Ng, ...
The system can't perform the operation now. Try again later.
Articles 1–7