Representation Engineering: A Top-Down Approach to AI Transparency A Zou, L Phan, S Chen, J Campbell, P Guo, R Ren, A Pan, X Yin, ... arXiv preprint arXiv:2310.01405, 2023 | 304 | 2023 |
Eight Methods to Evaluate Robust Unlearning in LLMs A Lynch*, P Guo*, A Ewart*, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024 | 43* | 2024 |
Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs A Sheshadri*, A Ewart*, P Guo*, A Lynch*, C Wu*, V Hebbar*, H Sleight, ... arXiv preprint arXiv:2407.15549, 2024 | 19* | 2024 |
Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching J Campbell*, R Ren*, P Guo* arXiv preprint arXiv:2311.15131, 2023 | 15 | 2023 |
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models A Syed*, PH Guo*, V Sundarapandiyan* | 14 | 2023 |
Robust Knowledge Unlearning via Mechanistic Localization P Guo*, A Syed*, A Sheshadri, A Ewart, GK Dziugaite Spotlight at ICML 2024 Workshop on Mechanistic Interpretability, 2024, 2024 | 3* | 2024 |
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization P Guo*, A Syed*, A Sheshadri, A Ewart, GK Dziugaite arXiv preprint arXiv:2410.12949, 2024 | 2 | 2024 |
Bandit-Based Multi-Start Strategies for Global Continuous Optimization P Guo, MC Fu 2022 Winter Simulation Conference (WSC), 3194-3205, 2022 | 2 | 2022 |