Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues G Bai, J Liu, X Bu, Y He, J Liu, Z Zhou, Z Lin, W Su, T Ge, B Zheng, ... arXiv preprint arXiv:2402.14762, 2024 | 53 | 2024 |
Graphreader: Building graph-based agent to enhance long-context abilities of large language models S Li, Y He, H Guo, X Bu, G Bai, J Liu, J Liu, X Qu, Y Li, W Ouyang, W Su, ... arXiv preprint arXiv:2406.14550, 2024 | 18 | 2024 |
Using auxiliary tasks in multimodal fusion of wav2vec 2.0 and bert for multimodal emotion recognition D Sun, Y He, J Han ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and …, 2023 | 17 | 2023 |
Chinese simpleqa: A chinese factuality evaluation for large language models Y He, S Li, J Liu, Y Tan, W Wang, H Huang, X Bu, H Guo, C Hu, B Zheng, ... arXiv preprint arXiv:2411.07140, 2024 | 9 | 2024 |
Aspect-Sentiment-Multiple-Opinion Triplet Extraction F Wang, Y Li, S Zhong, C Yin, Y He Natural Language Processing and Chinese Computing: 10th CCF International …, 2021 | 4 | 2021 |
Token preference optimization with self-calibrated visual-anchored rewards for hallucination mitigation J Gu, Y Wang, M Cao, P Bu, J Song, Y He, S Li, B Zheng arXiv preprint arXiv:2412.14487, 2024 | 2 | 2024 |
MuSC: Improving Complex Instruction Following with Multi-granularity Self-Contrastive Training H Huang, J Liu, Y He, S Li, B Xu, C Zhu, M Yang, T Zhao arXiv preprint arXiv:2502.11541, 2025 | 1 | 2025 |
ChineseEcomQA: A Scalable E-commerce Concept Evaluation Benchmark for Large Language Models H Chen, K Lv, C Hu, Y Li, Y Yuan, Y He, X Zhang, L Liu, S Liu, W Su, ... arXiv preprint arXiv:2502.20196, 2025 | | 2025 |
Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? Y He, S Li, J Liu, W Wang, X Bu, G Zhang, Z Peng, Z Zhang, W Su, ... arXiv preprint arXiv:2502.19361, 2025 | | 2025 |
AIR: Complex Instruction Generation via Automatic Iterative Refinement W Liu, Y He, H Huang, C Hu, J Liu, S Li, W Su, B Zheng arXiv preprint arXiv:2502.17787, 2025 | | 2025 |
CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models A Zhang, M Dong, J Liu, W Zhang, Y Wang, J Yang, G Zhang, T Liu, ... arXiv preprint arXiv:2502.16614, 2025 | | 2025 |
" See the World, Discover Knowledge": A Chinese Factuality Evaluation for Large Vision Language Models J Gu, Y Wang, P Bu, C Wang, Z Wang, T Song, D Wei, J Yuan, Y Zhao, ... arXiv preprint arXiv:2502.11718, 2025 | | 2025 |
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models Y Tan, B Zheng, B Zheng, K Cao, H Jing, J Wei, J Liu, Y He, W Su, X Zhu, ... arXiv preprint arXiv:2412.15265, 2024 | | 2024 |
WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis C Hu, J Zheng, Y He, H Guo, J Jiang, H Zhu, K Sun, Y Jiang, W Su, ... arXiv preprint arXiv:2412.03359, 2024 | | 2024 |
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision S Li, Y He, H Huang, X Bu, J Liu, H Guo, W Wang, J Gu, W Su, B Zheng arXiv preprint arXiv:2410.19720, 2024 | | 2024 |
HITMI&T at SemEval-2022 Task 4: Investigating Task-Adaptive Pretraining And Attention Mechanism On PCL Detection Z Liu, Y He, F Zhuang, B Xu Proceedings of the 16th International Workshop on Semantic Evaluation …, 2022 | | 2022 |