Andy Zou

Citada per

	Totes	Des de 2020
Citacions	7761	7751
Índex h	16	16
Índex i10	17	17

6000

3000

1500

4500

2021202220232024202555 231 1534 5421 488

Accés públic

Mostra-ho tot

2 articles

0 articles

disponibles

no disponibles

Es basa en els requisits de les agències que proporcionen el finançament

Segueix

Andy Zou

PhD Student, Carnegie Mellon University

Correu electrònic verificat a andrew.cmu.edu - Pàgina d'inici

ML Safety AI Safety


Títol Ordena per cites Ordena per any Ordena per títol	Citada per Citada per	Any
Measuring Massive Multitask Language Understanding D Hendrycks, C Burns, S Basart, A Zou, M Mazeika, D Song, J Steinhardt ICLR, 2020	3093	2020
Beyond the imitation game: Quantifying and extrapolating the capabilities of language models A Srivastava, A Rastogi, A Rao, AAM Shoeb, A Abid, A Fisch, AR Brown, ... TMLR, 2022	1291	2022
Universal and Transferable Adversarial Attacks on Aligned Language Models A Zou, Z Wang, N Carlini, N Milad, JZ Kolter, M Fredrikson arXiv preprint arXiv:2307.15043, 2023	1100	2023
Lessons from the Trenches on Reproducible Evaluation of Language Models S Biderman, H Schoelkopf, L Sutawika, L Gao, J Tow, B Abbasi, AF Aji, ... arXiv preprint arXiv:2405.14782, 2024	666*	2024
Scaling Out-of-Distribution Detection for Real-World Settings D Hendrycks, S Basart, M Mazeika, A Zou, J Kwon, M Mostajabi, ... ICML, 2021	489	2021
Representation Engineering: A Top-Down Approach to AI Transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023	304	2023
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal M Mazeika, L Phan, X Yin, A Zou, Z Wang, N Mu, E Sakhaee, N Li, ... ICML, 2024	182*	2024
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures D Hendrycks, A Zou, M Mazeika, L Tang, D Song, J Steinhardt CVPR, 2021	151	2021
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark A Pan, CJ Shern, A Zou, N Li, S Basart, T Woodside, J Ng, H Zhang, ... ICML, 2023	132	2023
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... ICML, 2024	90	2024
What Would Jiminy Cricket Do? Towards Agents That Behave Morally M Mazeika, A Zou, S Patel, C Zhu, J Navarro, D Song, B Li, J Steinhardt, ... NeurIPS, 2021	72*	2021
Improving Alignment and Robustness with Circuit Breakers A Zou, L Phan, J Wang, D Duenas, M Lin, M Andriushchenko, R Wang, ... arXiv preprint arXiv:2406.04313, 2024	60*	2024
Forecasting Future World Events with Neural Networks A Zou, T Xiao, R Jia, J Kwon, M Mazeika, R Li, D Song, J Steinhardt, ... NeurIPS, 2022	34	2022
The Trojan Detection Challenge M Mazeika, D Hendrycks, H Li, X Xu, S Hough, A Zou, A Rajabi, Q Yao, ... NeurIPS 2022 Competition Track, 279-291, 2022	31	2022
Tamper-resistant safeguards for open-weight llms R Tamirisa, B Bharathi, L Phan, A Zhou, A Gatti, T Suresh, M Lin, J Wang, ... arXiv preprint arXiv:2408.00761, 2024	21	2024
Unlocking Deterministic Robustness Certification on ImageNet K Hu, A Zou, Z Wang, K Leino, M Fredrikson NeurIPS, 2023	19*	2023
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios M Mazeika, E Tang, A Zou, S Basart, D Song, D Forsyth, J Steinhardt, ... NeurIPS, 2022	16	2022
Agentharm: A benchmark for measuring harmfulness of llm agents M Andriushchenko, A Souly, M Dziemian, D Duenas, M Lin, J Wang, ... arXiv preprint arXiv:2410.09024, 2024	6	2024
How hard is trojan detection in DNNs? fooling detectors with evasive trojans M Mazeika, A Zou, A Arora, P Pleskov, D Song, D Hendrycks, B Li, ...	4	2023
Humanity's Last Exam L Phan, A Gatti, Z Han, N Li, J Hu, H Zhang, S Shi, M Choi, A Agrawal, ... arXiv preprint arXiv:2501.14249, 2025		2025

En aquests moments el sistema no pot dur a terme l'operació. Torneu-ho a provar més tard.

Articles 1–20

Cites per any

Cites duplicades

Cites combinades

Addició de coautorsCoautors

Segueix

Citada per