Refusal in Language Models Is Mediated by a Single Direction A Arditi, O Obeso, A Syed, D Paleka, N Rimsky, W Gurnee, N Nanda NeurIPS 2024, 2024 | 95* | 2024 |
A framework for single-item nft auction mechanism design J Milionis, D Hirsch, A Arditi, P Garimidi Proceedings of the 2022 ACM CCS Workshop on Decentralized Finance and …, 2022 | 14* | 2022 |
Refusal in llms is mediated by a single direction A Arditi, O Balcells, A Syed, W Gurnee, N Nanda AI Alignment Forum, 2024 | 12* | 2024 |
Refusal mechanisms: initial experiments with Llama-2-7b-chat A Arditi, O Obeso AI Alignment Forum, 2023 | 3* | 2023 |