Urmăriți
Nicholas Schiefer
Nicholas Schiefer
Anthropic
Adresă de e-mail confirmată pe mit.edu
Titlu
Citat de
Citat de
Anul
Constitutional ai: Harmlessness from ai feedback
Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ...
arXiv preprint arXiv:2212.08073, 2022
13452022
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned
D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ...
arXiv preprint arXiv:2209.07858, 2022
5172022
Language models (mostly) know what they know
S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ...
arXiv preprint arXiv:2207.05221, 2022
3712022
Towards monosemanticity: Decomposing language models with dictionary learning
T Bricken, A Templeton, J Batson, B Chen, A Jermyn, T Conerly, N Turner, ...
Transformer Circuits Thread 2, 2023
3212023
Discovering language model behaviors with model-written evaluations
E Perez, S Ringer, K Lukosiute, K Nguyen, E Chen, S Heiner, C Pettit, ...
Findings of the Association for Computational Linguistics: ACL 2023, 13387-13434, 2023
2582023
Toy models of superposition
N Elhage, T Hume, C Olsson, N Schiefer, T Henighan, S Kravec, ...
arXiv preprint arXiv:2209.10652, 2022
2352022
Towards measuring the representation of subjective global opinions in language models
E Durmus, K Nguyen, TI Liao, N Schiefer, A Askell, A Bakhtin, C Chen, ...
arXiv preprint arXiv:2306.16388, 2023
1812023
Towards understanding sycophancy in language models
M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ...
arXiv preprint arXiv:2310.13548, 2023
1772023
The capacity for moral self-correction in large language models
D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ...
arXiv preprint arXiv:2302.07459, 2023
1542023
Dawn Drain
D Ganguli, L Lovitt, AA Jackson Kernion, Y Bai, S Kadavath, B Mann, ...
Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom …, 2022
1452022
Sleeper agents: Training deceptive llms that persist through safety training
E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ...
arXiv preprint arXiv:2401.05566, 2024
1402024
Measuring faithfulness in chain-of-thought reasoning
T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ...
arXiv preprint arXiv:2307.13702, 2023
1122023
Measuring progress on scalable oversight for large language models
SR Bowman, J Hyun, E Perez, E Chen, C Pettit, S Heiner, K Lukošiūtė, ...
arXiv preprint arXiv:2211.03540, 2022
1082022
Many-shot jailbreaking
C Anil, E Durmus, N Panickssery, M Sharma, J Benton, S Kundu, J Batson, ...
Advances in Neural Information Processing Systems 37, 129696-129742, 2025
992025
Question decomposition improves the faithfulness of model-generated reasoning
A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ...
arXiv preprint arXiv:2307.11768, 2023
522023
Superposition, memorization, and double descent
T Henighan, S Carter, T Hume, N Elhage, R Lasenby, S Fort, N Schiefer, ...
Transformer Circuits Thread 6, 24, 2023
332023
Specific versus general principles for constitutional ai
S Kundu, Y Bai, S Kadavath, A Askell, A Callahan, A Chen, A Goldie, ...
arXiv preprint arXiv:2310.13798, 2023
302023
Universal Computation and Optimal Construction in the Chemical Reaction Network-Controlled Tile Assembly Model
N Schiefer, E Winfree
21st International Conference on DNA Computing and Molecular Programming …, 2015
262015
Sycophancy to subterfuge: Investigating reward-tampering in large language models
C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ...
arXiv preprint arXiv:2406.10162, 2024
252024
FoundationDB Record Layer: A Multi-Tenant Structured Datastore
C Chrysafis, B Collins, S Dugas, J Dunkelberger, M Ehsan, S Gray, ...
Proceedings of the 2019 International Conference on Management of Data, 1787 …, 2019
252019
Sistemul nu poate realiza operația în acest moment. Încercați din nou mai târziu.
Articole 1–20