Spremljaj
Tony Wang
Tony Wang
PhD student, MIT
Preverjeni e-poštni naslov na mit.edu - Domača stran
Naslov
Navedeno
Navedeno
Leto
Open problems and fundamental limitations of reinforcement learning from human feedback
S Casper, X Davies, C Shi, TK Gilbert, J Scheurer, J Rando, R Freedman, ...
arXiv preprint arXiv:2307.15217, 2023
4802023
Adversarial Policies Beat Superhuman Go AIs
TT Wang, A Gleave, N Belrose, T Tseng, J Miller, MD Dennis, Y Duan, ...
arXiv preprint arXiv:2211.00241, 2022
63*2022
Neural-guided, bidirectional program search for abstraction and reasoning
S Alford, A Gandhi, A Rangamani, A Banburski, T Wang, S Dandekar, ...
Complex Networks & Their Applications X: Volume 1, Proceedings of the Tenth …, 2022
242022
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
D Halawi, A Wei, E Wallace, TT Wang, N Haghtalab, J Steinhardt
arXiv preprint arXiv:2406.20053, 2024
232024
SDP Methods for Sensitivity-Constrained Privacy Funnel and Information Bottleneck Problems
Y Bu, T Wang, GW Wornell
2021 IEEE International Symposium on Information Theory (ISIT), 49-54, 2021
62021
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
TT Wang, J Hughes, H Sleight, R Schaeffer, R Agrawal, F Barez, ...
arXiv preprint arXiv:2412.02159, 2024
32024
Forbidden Facts: An Investigation of Competing Objectives in Llama-2
TT Wang, M Wang, K Hariharan, N Shavit
arXiv preprint arXiv:2312.08793, 2023
32023
Can Go AIs be adversarially robust?
T Tseng, E McLean, K Pelrine, TT Wang, A Gleave
arXiv preprint arXiv:2406.12843, 2024
12024
A connectomics-driven analysis reveals novel characterization of border regions in mouse visual cortex
N Tumma, L Kong, S Sawmya, TT Wang, N Shavit
bioRxiv, 2024.05. 24.595837, 2024
2024
Cliff-Learning
TT Wang, I Zablotchi, N Shavit, JS Rosenfeld
arXiv preprint arXiv:2302.07348, 2023
2023
Adversarial Examples in Simpler Settings
TT Wang
Massachusetts Institute of Technology, 2021
2021
Sistem trenutno ne more izvesti postopka. Poskusite znova pozneje.
Članki 1–11