Stebėti
Kaiyue Wen
Kaiyue Wen
Phd Student, Stanford University
Patvirtintas el. paštas stanford.edu - Pagrindinis puslapis
Pavadinimas
Cituota
Cituota
Metai
On transferability of prompt tuning for natural language processing
Y Su, X Wang, Y Qin, CM Chan, Y Lin, H Wang, K Wen, Z Liu, P Li, J Li, ...
arXiv preprint arXiv:2111.06719, 2021
1522021
How Sharpness-Aware Minimization Minimizes Sharpness?
K Wen, T Ma, Z Li
International Conference on Learning Representations, 0
88*
Finding Skill Neurons in Pre-trained Transformer-based Language Models
X Wang, K Wen, Z Zhang, L Hou, Z Liu, J Li
arXiv preprint arXiv:2211.07349, 2022
732022
Sharpness minimization algorithms do not only minimize sharpness to achieve better generalization
K Wen, Z Li, T Ma
Advances in Neural Information Processing Systems 36, 1024-1035, 2023
302023
Transformers are uninterpretable with myopic methods: a case study with bounded dyck grammars
K Wen, Y Li, B Liu, A Risteski
Advances in Neural Information Processing Systems 36, 38723-38766, 2023
242023
Rnns are not transformers (yet): The key bottleneck on in-context retrieval
K Wen, X Dang, K Lyu
arXiv preprint arXiv:2402.18510, 2024
202024
Benign overfitting in classification: Provably counter label noise with larger models
K Wen, J Teng, J Zhang
arXiv preprint arXiv:2206.00501, 2022
9*2022
Residual permutation test for high-dimensional regression coefficient testing
K Wen, T Wang, Y Wang
arXiv preprint arXiv:2211.16182, 2022
62022
Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective
K Wen, Z Li, J Wang, D Hall, P Liang, T Ma
arXiv preprint arXiv:2410.05192, 2024
32024
From sparse dependence to sparse attention: unveiling how chain-of-thought enhances transformer sample efficiency
K Wen, H Zhang, H Lin, J Zhang
arXiv preprint arXiv:2410.05459, 2024
22024
Residual permutation test for regression coefficient testing
K Wen, T Wang, Y Wang
arXiv e-prints, arXiv: 2211.16182, 2022
12022
Task Generalization With AutoRegressive Compositional Structure: Can Learning From $\d $ Tasks Generalize to $\d^{T} $ Tasks?
A Abedsoltan, H Zhang, K Wen, H Lin, J Zhang, M Belkin
arXiv preprint arXiv:2502.08991, 2025
2025
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Z Qiu, Z Huang, B Zheng, K Wen, Z Wang, R Men, I Titov, D Liu, J Zhou, ...
arXiv preprint arXiv:2501.11873, 2025
2025
Practically Solving LPN in High Noise Regimes Faster Using Neural Networks
H Jiang, K Wen, Y Chen
arXiv preprint arXiv:2303.07987, 2023
2023
Sistema negali atlikti operacijos. Bandykite vėliau dar kartą.
Straipsniai 1–14