متابعة
Nathaniel Li
Nathaniel Li
Scale AI, Center for AI Safety
بريد إلكتروني تم التحقق منه على berkeley.edu - الصفحة الرئيسية
عنوان
عدد مرات الاقتباسات
عدد مرات الاقتباسات
السنة
Representation Engineering: A Top-Down Approach to AI Transparency
A Zou, L Phan*, S Chen*, J Campbell*, P Guo*, R Ren*, A Pan, X Yin, ...
arXiv preprint arXiv:2310.01405, 2023
312*2023
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ...
ICML 2024, 2024
188*2024
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
A Pan*, JS Chan*, A Zou*, N Li, S Basart, T Woodside, H Zhang, ...
ICML 2023 (Oral), 26837-26867, 2023
1342023
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
N Li*, A Pan*, A Gopal†, S Yue†, D Berrios†, A Gatti‡, JD Li‡, ...
ICML 2024, 2024
1032024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
N Li, Z Han, I Steneker, W Primack, R Goodside, H Zhang, Z Wang, ...
NeurIPS 2024 Red Teaming Workshop (Oral), 2024
252024
Humanity's Last Exam
L Phan*, A Gatti*, Z Han*, N Li*, J Hu, H Zhang, S Shi, M Choi, A Agrawal, ...
arXiv preprint arXiv:2501.14249, 2025
2025
Robustness Evaluation of Proxy Models Against Adversarial Optimization
A Zou, L Phan, N Li, JS Chan, M Mazeika, A O'Gara, S Basart, J Ng, ...
يتعذر على النظام إجراء العملية في الوقت الحالي. عاود المحاولة لاحقًا.
مقالات 1–7