Baichuan 2: Open large-scale language models A Yang, B Xiao, B Wang, B Zhang, C Bian, C Yin, C Lv, D Pan, D Wang, ... arXiv preprint arXiv:2309.10305, 2023 | 587* | 2023 |
Beavertails: Towards improved safety alignment of llm via a human-preference dataset J Ji, M Liu, J Dai, X Pan, C Zhang, C Bian, B Chen, R Sun, Y Wang, ... Advances in Neural Information Processing Systems 36, 24678-24704, 2023 | 335 | 2023 |
Safe rlhf: Safe reinforcement learning from human feedback J Dai, X Pan, R Sun, J Ji, X Xu, M Liu, Y Wang, Y Yang The Twelfth International Conference on Learning Representations (Spotlight), 2024 | 271 | 2024 |
Ai alignment: A comprehensive survey J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang, Y Duan, Z He, J Zhou, ... arXiv preprint arXiv:2310.19852, 2023 | 245 | 2023 |
Constrained update projection approach to safe policy optimization L Yang, J Ji, J Dai, L Zhang, B Zhou, P Li, Y Yang, G Pan Advances in Neural Information Processing Systems 35, 9111-9124, 2022 | 75* | 2022 |
Safety gymnasium: A unified safe reinforcement learning benchmark J Ji, B Zhang, J Zhou, X Pan, W Huang, R Sun, Y Geng, Y Zhong, J Dai, ... Advances in Neural Information Processing Systems 36, 2023 | 74* | 2023 |
Aligner: Achieving efficient alignment through weak-to-strong correction J Ji, B Chen, H Lou, D Hong, B Zhang, X Pan, J Dai, Y Yang arXiv e-prints, arXiv: 2402.02416, 2024 | 50 | 2024 |
Omnisafe: An infrastructure for accelerating safe reinforcement learning research J Ji, J Zhou, B Zhang, J Dai, X Pan, R Sun, W Huang, Y Geng, M Liu, ... Journal of Machine Learning Research 25 (285), 1-6, 2024 | 49 | 2024 |
Pku-saferlhf: Towards multi-level safety alignment for llms with human preference J Ji, D Hong, B Zhang, B Chen, J Dai, B Zheng, T Qiu, B Li, Y Yang arXiv preprint arXiv:2406.15513, 2024 | 31* | 2024 |
Augmented proximal policy optimization for safe reinforcement learning J Dai, J Ji, L Yang, Q Zheng, G Pan Proceedings of the AAAI Conference on Artificial Intelligence 37 (6), 7288-7295, 2023 | 15 | 2023 |
Safesora: Towards safety alignment of text2video generation via a human preference dataset J Dai, T Chen, X Wang, Z Yang, T Chen, J Ji, Y Yang Advances in Neural Information Processing Systems 37, 17161-17214, 2025 | 4 | 2025 |
Rethinking information structures in rlhf: Reward generalization from a graph theory perspective T Qiu, F Zeng, J Ji, D Yan, K Wang, J Zhou, H Yang, J Dai, X Pan, Y Yang arXiv e-prints, arXiv: 2402.10184, 2024 | 4 | 2024 |
Align anything: Training all-modality models to follow instructions with language feedback J Ji, J Zhou, H Lou, B Chen, D Hong, X Wang, W Chen, K Wang, R Pan, ... arXiv preprint arXiv:2412.15838, 2024 | 3 | 2024 |
Safe Reinforcement Learning using Finite-Horizon Gradient-based Estimation J Dai, Y Yang, Q Zheng, G Pan Forty-first International Conference on Machine Learning, 2024 | 1 | 2024 |