Representation engineering: A top-down approach to ai transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 335 | 2023 |
The effects of reward misspecification: Mapping and mitigating misaligned models A Pan, K Bhatia, J Steinhardt arXiv preprint arXiv:2201.03544, 2022 | 170 | 2022 |
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024 | 140 | 2024 |
Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark A Pan, JS Chan, A Zou, N Li, S Basart, T Woodside, H Zhang, S Emmons, ... International conference on machine learning, 26837-26867, 2023 | 130 | 2023 |
The wmdp benchmark: Measuring and reducing malicious use with unlearning N Li, A Pan, A Gopal, S Yue, D Berrios, A Gatti, JD Li, AK Dombrowski, ... arXiv preprint arXiv:2403.03218, 2024 | 104 | 2024 |
Feedback loops with language models drive in-context reward hacking A Pan, E Jones, M Jagadeesan, J Steinhardt arXiv preprint arXiv:2402.06627, 2024 | 21 | 2024 |
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? R Ren, S Basart, A Khoja, A Gatti, L Phan, X Yin, M Mazeika, A Pan, ... arXiv preprint arXiv:2407.21792, 2024 | 13 | 2024 |
Improving robustness of reinforcement learning for power system control with adversarial training A Pan, Y Lee, H Zhang, Y Chen, Y Shi arXiv preprint arXiv:2110.08956, 2021 | 11 | 2021 |
LatentQA: Teaching LLMs to Decode Activations Into Natural Language A Pan, L Chen, J Steinhardt arXiv preprint arXiv:2412.08686, 2024 | | 2024 |