Stephen Casper

引用先

	すべて	2020 年以来
引用	1490	1488
h 指標	18	18
i10 指標	24	24

1200

600

300

900

2020202120222023202420256 14 45 222 1101 97

オープンアクセス

すべて表示

2 件の論文

0 件の論文

利用可能

利用不可

助成機関の要件に基づく

共著者

Dylan Hadfield-MenellMassachusetts Institute of Technology確認したメールアドレス: csail.mit.edu
Daniel FilanPhD Student, UC Berkeley確認したメールアドレス: berkeley.edu
Gabriel KreimanProfessor, Harvard Medical School and Children's Hospital確認したメールアドレス: tch.harvard.edu
Andrew CritchUC Berkeley, Department of Electrical Engineering and Computer Sciences確認したメールアドレス: eecs.berkeley.edu
Stuart RussellProfessor of Computer Science, University of California, Berkeley確認したメールアドレス: cs.berkeley.edu
Shlomi HodPhD Candidate, Boston University確認したメールアドレス: bu.edu
Cody WildGoogle DeepMind確認したメールアドレス: google.com
Soroush PourHarmony Intelligence確認したメールアドレス: soroushjp.com
Javier RandoPhD Student @ ETH Zurich確認したメールアドレス: ai.ethz.ch
Arush TagadeML Researcher, Leap Laboratories確認したメールアドレス: leap-labs.com
Anson HoEpoch AI確認したメールアドレス: epochai.org
Tilman RäukerIndependent確認したメールアドレス: pivotal-research.org
Jérémy ScheurerApollo Research確認したメールアドレス: apolloresearch.ai
Ben BucknallDPhil Student, University of Oxford確認したメールアドレス: robots.ox.ac.uk
David Scott KruegerUniversity Assistant Professor, University of Cambridge確認したメールアドレス: cam.ac.uk
Xavier BoixMIT確認したメールアドレス: mit.edu
Kasper VinkenHarvard Medical School確認したメールアドレス: hms.harvard.edu
Rusheb ShahApollo Research確認したメールアドレス: apolloresearch.ai
Jason LinDeepMind / Stanford確認したメールアドレス: stanford.edu
Gatlen CulpMassachusetts Institute of Technology確認したメールアドレス: mit.edu

フォロー

Stephen Casper

PhD student, MIT

確認したメールアドレス: mit.edu - ホームページ

AI safety AI responsibility red-teaming robustness auditing


タイトル引用回数順公開年順タイトル順	引用先引用先	年
Open problems and fundamental limitations of reinforcement learning from human feedback S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ... arXiv preprint arXiv:2307.15217, 2023	456	2023
Toward transparent ai: A survey on interpreting the inner structures of deep neural networks T Räuker, A Ho, S Casper, D Hadfield-Menell 2023 ieee conference on secure and trustworthy machine learning (satml), 464-483, 2023	190	2023
Foundational challenges in assuring alignment and safety of large language models U Anwar, A Saparov, J Rando, D Paleka, M Turpin, P Hase, ES Lubana, ... arXiv preprint arXiv:2404.09932, 2024	116	2024
Rethinking machine unlearning for large language models S Liu, Y Yao, J Jia, S Casper, N Baracaldo, P Hase, Y Yao, CY Liu, X Xu, ... arXiv preprint arXiv:2402.08787, 2024	97	2024
Scalable and transferable black-box jailbreaks for language models via persona modulation R Shah, S Pour, A Tagade, S Casper, J Rando arXiv preprint arXiv:2311.03348, 2023	88	2023
Explore, establish, exploit: Red teaming language models from scratch S Casper, J Lin, J Kwon, G Culp, D Hadfield-Menell arXiv preprint arXiv:2306.09442, 2023	84	2023
Black-box access is insufficient for rigorous ai audits S Casper, C Ezell, C Siegmann, N Kolt, TL Curtis, B Bucknall, A Haupt, ... The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2254-2272, 2024	60	2024
Eight methods to evaluate robust unlearning in llms A Lynch, P Guo, A Ewart, S Casper, D Hadfield-Menell arXiv preprint arXiv:2402.16835, 2024	39	2024
Red teaming deep neural networks with feature synthesis tools S Casper, T Bu, Y Li, J Li, K Zhang, K Hariharan, D Hadfield-Menell Advances in Neural Information Processing Systems 36, 80470-80516, 2023	35*	2023
Clusterability in neural networks D Filan, S Casper, S Hod, C Wild, A Critch, S Russell arXiv preprint arXiv:2103.03386, 2021	35	2021
Frivolous units: Wider networks are not really that wide S Casper, X Boix, V D'Amario, L Guo, M Schrimpf, K Vinken, G Kreiman Proceedings of the AAAI Conference on Artificial Intelligence 35 (8), 6921-6929, 2021	32*	2021
Robust feature-level adversaries are interpretability tools S Casper, M Nadeau, D Hadfield-Menell, G Kreiman Advances in Neural Information Processing Systems 35, 33093-33106, 2022	31	2022
Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? K Liu, S Casper, D Hadfield-Menell, J Andreas arXiv preprint arXiv:2312.03729, 2023	28	2023
Open problems in technical ai governance A Reuel, B Bucknall, S Casper, T Fist, L Soder, O Aarne, L Hammond, ... arXiv preprint arXiv:2407.14981, 2024	22	2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training S Casper, L Schulze, O Patel, D Hadfield-Menell arXiv preprint arXiv:2403.05030, 2024	20	2024
Latent adversarial training improves robustness to persistent harmful behaviors in llms A Sheshadri, A Ewart, P Guo, A Lynch, C Wu, V Hebbar, H Sleight, ... arXiv preprint arXiv:2407.15549, 2024	19*	2024
Multiplex base editing to convert TAG into TAA codons in the human genome Y Chen, E Hysolli, A Chen, S Casper, S Liu, K Yang, C Liu, G Church Nature communications 13 (1), 4482, 2022	19	2022
The ai risk repository: A comprehensive meta-review, database, and taxonomy of risks from artificial intelligence P Slattery, AK Saeri, EAC Grundy, J Graham, M Noetel, R Uuk, J Dao, ... arXiv preprint arXiv:2408.12622, 2024	18	2024
Probing neural dialog models for conversational understanding A Saleh, T Deutsch, S Casper, Y Belinkov, S Shieber arXiv preprint arXiv:2006.08331, 2020	15	2020
Graphical clusterability and local specialization in deep neural networks S Casper, S Hod, D Filan, C Wild, A Critch, S Russell ICLR 2022 Workshop on PAIR {\textasciicircum} 2Struct: Privacy …, 2022	14	2022

現在システムで処理を実行できません。しばらくしてからもう一度お試しください。

論文 1–20

年間引用数

重複した引用

結合された引用

共著者を追加共著者

フォロー

引用先

共著者