MMLU-Pro: A More Robust and Challenging Multi-task Language Understanding Benchmark Y Wang, X Ma, G Zhang, Y Ni, A Chandra, S Guo, W Ren, A Arulraj, X He, ... NeurIPS 2024 (Spotlight), 2024 | 161* | 2024 |
Augmenting Black-box LLMs with Medical Textbooks for Clinical Question Answering Y Wang, X Ma, W Chen Findings of EMNLP 2024, 2023 | 60* | 2023 |
Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark X Yue, T Zheng, Y Ni, Y Wang, K Zhang, S Tong, Y Sun, B Yu, G Zhang, ... arXiv preprint arXiv:2409.02813, 2024 | 35 | 2024 |
MAP-NEO: Highly capable and transparent bilingual large language model series G Zhang, S Qu, J Liu, C Zhang, C Lin, CL Yu, D Pan, E Cheng, J Liu, ... arXiv preprint arXiv:2405.19327, 2024 | 33 | 2024 |
Mammoth-vl: Eliciting multimodal reasoning with instruction tuning at scale J Guo, T Zheng, Y Bai, B Li, Y Wang, K Zhu, Y Li, G Neubig, W Chen, ... arXiv preprint arXiv:2412.05237, 2024 | 5 | 2024 |
Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks J Chen, T Liang, S Siu, Z Wang, K Wang, Y Wang, Y Ni, W Zhu, Z Jiang, ... ICLR 2025, 2024 | 3 | 2024 |
Critique fine-tuning: Learning to critique is more effective than learning to imitate Y Wang, X Yue, W Chen arXiv preprint arXiv:2501.17703, 2025 | 2 | 2025 |
Pin: A knowledge-intensive dataset for paired and interleaved multimodal documents J Wang, Y Zhang, Y Ji, Y Zhang, C Jiang, Y Wang, K Zhu, Z Wang, ... arXiv preprint arXiv:2406.13923, 2024 | 2 | 2024 |