Nicholas Schiefer

Citat de

	Toate	Din 2020
Referințe bibliografice	4413	4395
h-index	21	20
i10-index	21	21

2800

1400

700

2100

2021202220232024202512 47 961 2765 584

Acces public

Afișați-le pe toate

6 articole

0 articole

disponibile

indisponibile

Pe baza cerințelor privind finanțarea

Coautori

Zac Hatfield DoddsAnthropic; Australian National UniversityAdresă de e-mail confirmată pe anu.edu.au
Jared KaplanJohns Hopkins University & AnthropicAdresă de e-mail confirmată pe pha.jhu.edu
Robert LasenbyStanford UniversityAdresă de e-mail confirmată pe stanford.edu
Carol ChenMember of Technical StaffAdresă de e-mail confirmată pe anthropic.com
Christopher OlahAnthropicAdresă de e-mail confirmată pe google.com
Dario AmodeiCEO and Co-Founder at AnthropicAdresă de e-mail confirmată pe anthropic.com
Catherine OlssonAnthropicAdresă de e-mail confirmată pe mit.edu
Dawn DrainMicrosoftAdresă de e-mail confirmată pe microsoft.com
Roger GrosseAssociate Professor, University of TorontoAdresă de e-mail confirmată pe cs.toronto.edu
Erik WinfreeCalifornia Institute of TechnologyAdresă de e-mail confirmată pe caltech.edu
Shyam NarayananPhD Student, MITAdresă de e-mail confirmată pe mit.edu
Piotr IndykProfessor of Electrical Engineering and Computer Science, MITAdresă de e-mail confirmată pe mit.edu
Alexander ShraerGoogleAdresă de e-mail confirmată pe google.com
Kfir Lev-AriAppleAdresă de e-mail confirmată pe alumni.technion.ac.il
Tao LinMeta Platforms, Inc.Adresă de e-mail confirmată pe fb.com
Anders AamandUniversity of CopenhagenAdresă de e-mail confirmată pe mit.edu
Ronitt RubinfeldProfessor of Computer Science, MIT and Tel Aviv UniversityAdresă de e-mail confirmată pe csail.mit.edu
Daniel JacksonMITAdresă de e-mail confirmată pe mit.edu
Geoffrey LittPhD Student, MITAdresă de e-mail confirmată pe mit.edu
Helen XuGeorgia Institute of TechnologyAdresă de e-mail confirmată pe gatech.edu

Urmăriți

Nicholas Schiefer

Anthropic

Adresă de e-mail confirmată pe mit.edu


Titlu Sortați după descrierea bibliografică Sortați după an Sortați după titlu	Citat de Citat de	Anul
Constitutional ai: Harmlessness from ai feedback Y Bai, S Kadavath, S Kundu, A Askell, J Kernion, A Jones, A Chen, ... arXiv preprint arXiv:2212.08073, 2022	1345	2022
Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned D Ganguli, L Lovitt, J Kernion, A Askell, Y Bai, S Kadavath, B Mann, ... arXiv preprint arXiv:2209.07858, 2022	517	2022
Language models (mostly) know what they know S Kadavath, T Conerly, A Askell, T Henighan, D Drain, E Perez, ... arXiv preprint arXiv:2207.05221, 2022	371	2022
Towards monosemanticity: Decomposing language models with dictionary learning T Bricken, A Templeton, J Batson, B Chen, A Jermyn, T Conerly, N Turner, ... Transformer Circuits Thread 2, 2023	321	2023
Discovering language model behaviors with model-written evaluations E Perez, S Ringer, K Lukosiute, K Nguyen, E Chen, S Heiner, C Pettit, ... Findings of the Association for Computational Linguistics: ACL 2023, 13387-13434, 2023	258	2023
Toy models of superposition N Elhage, T Hume, C Olsson, N Schiefer, T Henighan, S Kravec, ... arXiv preprint arXiv:2209.10652, 2022	235	2022
Towards measuring the representation of subjective global opinions in language models E Durmus, K Nguyen, TI Liao, N Schiefer, A Askell, A Bakhtin, C Chen, ... arXiv preprint arXiv:2306.16388, 2023	181	2023
Towards understanding sycophancy in language models M Sharma, M Tong, T Korbak, D Duvenaud, A Askell, SR Bowman, ... arXiv preprint arXiv:2310.13548, 2023	177	2023
The capacity for moral self-correction in large language models D Ganguli, A Askell, N Schiefer, TI Liao, K Lukošiūtė, A Chen, A Goldie, ... arXiv preprint arXiv:2302.07459, 2023	154	2023
Dawn Drain D Ganguli, L Lovitt, AA Jackson Kernion, Y Bai, S Kadavath, B Mann, ... Nelson Elhage, Sheer El-Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom …, 2022	145	2022
Sleeper agents: Training deceptive llms that persist through safety training E Hubinger, C Denison, J Mu, M Lambert, M Tong, M MacDiarmid, ... arXiv preprint arXiv:2401.05566, 2024	140	2024
Measuring faithfulness in chain-of-thought reasoning T Lanham, A Chen, A Radhakrishnan, B Steiner, C Denison, ... arXiv preprint arXiv:2307.13702, 2023	112	2023
Measuring progress on scalable oversight for large language models SR Bowman, J Hyun, E Perez, E Chen, C Pettit, S Heiner, K Lukošiūtė, ... arXiv preprint arXiv:2211.03540, 2022	108	2022
Many-shot jailbreaking C Anil, E Durmus, N Panickssery, M Sharma, J Benton, S Kundu, J Batson, ... Advances in Neural Information Processing Systems 37, 129696-129742, 2025	99	2025
Question decomposition improves the faithfulness of model-generated reasoning A Radhakrishnan, K Nguyen, A Chen, C Chen, C Denison, D Hernandez, ... arXiv preprint arXiv:2307.11768, 2023	52	2023
Superposition, memorization, and double descent T Henighan, S Carter, T Hume, N Elhage, R Lasenby, S Fort, N Schiefer, ... Transformer Circuits Thread 6, 24, 2023	33	2023
Specific versus general principles for constitutional ai S Kundu, Y Bai, S Kadavath, A Askell, A Callahan, A Chen, A Goldie, ... arXiv preprint arXiv:2310.13798, 2023	30	2023
Universal Computation and Optimal Construction in the Chemical Reaction Network-Controlled Tile Assembly Model N Schiefer, E Winfree 21st International Conference on DNA Computing and Molecular Programming …, 2015	26	2015
Sycophancy to subterfuge: Investigating reward-tampering in large language models C Denison, M MacDiarmid, F Barez, D Duvenaud, S Kravec, S Marks, ... arXiv preprint arXiv:2406.10162, 2024	25	2024
FoundationDB Record Layer: A Multi-Tenant Structured Datastore C Chrysafis, B Collins, S Dugas, J Dunkelberger, M Ehsan, S Gray, ... Proceedings of the 2019 International Conference on Management of Data, 1787 …, 2019	25	2019

Sistemul nu poate realiza operația în acest moment. Încercați din nou mai târziu.

Articole 1–20

Referințe bibliografice pe an

Citate duplicat

Citate fuzionate

Adăugați coautoriCoautori

Urmăriți

Citat de

Coautori