Is dpo superior to ppo for llm alignment? a comprehensive study S Xu, W Fu, J Gao, W Ye, W Liu, Z Mei, G Wang, C Yu, Y Wu arXiv preprint arXiv:2404.10719, 2024 | 71 | 2024 |
Revisiting some common practices in cooperative multi-agent reinforcement learning W Fu, C Yu, Z Xu, J Yang, Y Wu arXiv preprint arXiv:2206.07505, 2022 | 42 | 2022 |
Continuously discovering novel strategies via reward-switching policy optimization Z Zhou, W Fu, B Zhang, Y Wu arXiv preprint arXiv:2204.02246, 2022 | 33 | 2022 |
Learning agile bipedal motions on a quadrupedal robot Y Li, J Li, W Fu, Y Wu 2024 IEEE International Conference on Robotics and Automation (ICRA), 9735-9742, 2024 | 9 | 2024 |
Srl: Scaling distributed reinforcement learning to over ten thousand cores Z Mei, W Fu, J Gao, G Wang, H Zhang, Y Wu arXiv preprint arXiv:2306.16688, 2023 | 5 | 2023 |
Iteratively learn diverse strategies with state distance information W Fu, W Du, J Li, S Chen, J Zhang, Y Wu Advances in Neural Information Processing Systems 36, 2024 | 4 | 2024 |
ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation Z Mei, W Fu, K Li, G Wang, H Zhang, Y Wu arXiv preprint arXiv:2406.14088, 2024 | 3 | 2024 |
On designing effective rl reward at training time for llm reasoning J Gao, S Xu, W Ye, W Liu, C He, W Fu, Z Mei, G Wang, Y Wu arXiv preprint arXiv:2410.15115, 2024 | 2 | 2024 |
Iteratively learning novel strategies with diversity measured in state distances W Fu, W Du, J Li, S Chen, J Zhang, Y Wu | 1 | 2023 |
Unlocking the Potential of MAPPO with Asynchronous Optimization W Fu, C Yu, Y Li, Y Wu Artificial Intelligence: First CAAI International Conference, CICAI 2021 …, 2021 | | 2021 |