Ai alignment: A comprehensive survey J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang, Y Duan, Z He, J Zhou, ... arXiv preprint arXiv:2310.19852, 2023 | 226 | 2023 |
AI deception: A survey of examples, risks, and potential solutions PS Park, S Goldstein, A O’Gara, M Chen, D Hendrycks Patterns 5 (5), 2024 | 161 | 2024 |
Hoodwinked: Deception and cooperation in a text-based game for language models A O'Gara arXiv preprint arXiv:2308.01404, 2023 | 26 | 2023 |
AI deception: A survey of examples, risks, and potential solutions. arXiv PS Park, S Goldstein, A O’Gara, M Chen, D Hendrycks URL: http://arxiv. org/abs/2308.14752, 2023 | 9 | 2023 |
AI Deception: A Survey of Examples PS Park, S Goldstein, A O’Gara, M Chen, D Hendrycks Risks, and Potential Solutions. arXiv, 1-30, 2023 | 6 | 2023 |
Ai alignment: A comprehensive survey. arXiv J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang, Y Duan, Z He, J Zhou, ... arXiv preprint arXiv:2310.19852, 2023 | 6 | 2023 |
Open Problems in Machine Unlearning for AI Safety F Barez, T Fu, A Prabhu, S Casper, A Sanyal, A Bibi, A O'Gara, R Kirk, ... arXiv preprint arXiv:2501.04952, 2025 | 1 | 2025 |
Robustness Evaluation of Proxy Models against Adversarial Optimization A Zou, L Phan, N Li, JS Chan, M Mazeika, A O'Gara, S Basart, J Ng, ... | | |