Mmlu-pro: A more robust and challenging multi-task language understanding benchmark Y Wang, X Ma, G Zhang, Y Ni, A Chandra, S Guo, W Ren, A Arulraj, X He, ... NeurIPS 2024 (Spotlight), 2024 | 148 | 2024 |
Mantis: Interleaved multi-image instruction tuning D Jiang, X He, H Zeng, C Wei, M Ku, Q Liu, W Chen TMLR 2024, 2024 | 77 | 2024 |
Videoscore: Building automatic metrics to simulate fine-grained human feedback for video generation X He, D Jiang, G Zhang, M Ku, A Soni, S Siu, H Chen, A Chandra, Z Jiang, ... EMNLP Main 2024, 2024 | 27 | 2024 |
Mega-bench: Scaling multimodal evaluation to over 500 real-world tasks J Chen, T Liang, S Siu, Z Wang, K Wang, Y Wang, Y Ni, W Zhu, Z Jiang, ... ICLR 2025, 2024 | 3 | 2024 |
Guiding Through Complexity: What Makes Good Supervision for Hard Reasoning Tasks? X He, D Yin, N Peng NAACL Main 2025, 2024 | | 2024 |