Explainaboard: An explainable leaderboard for nlp P Liu, J Fu, Y Xiao, W Yuan, S Chang, J Dai, Y Liu, Z Ye, ZY Dou, ... arXiv preprint arXiv:2104.06387, 2021 | 66 | 2021 |
On the Robustness of Reading Comprehension Models to Entity Renaming J Yan, Y Xiao, S Mukherjee, BY Lin, R Jia, X Ren arXiv preprint arXiv:2110.08555, 2021 | 17 | 2021 |
Datalab: A platform for data analysis and intervention Y Xiao, J Fu, W Yuan, V Viswanathan, Z Liu, Y Liu, G Neubig, P Liu arXiv preprint arXiv:2202.12875, 2022 | 14 | 2022 |
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI Z Huang, Z Wang, S Xia, X Li, H Zou, R Xu, RZ Fan, L Ye, E Chern, Y Ye, ... arXiv preprint arXiv:2406.12753, 2024 | 9 | 2024 |
How Far Are We from Believable AI Agents? A Framework for Evaluating the Believability of Human Behavior Simulation Y Xiao, Y Cheng, J Fu, J Wang, W Li, P Liu arXiv preprint arXiv:2312.17115, 2023 | 8 | 2023 |
Towards a Client-Centered Assessment of LLM Therapists by Client Simulation J Wang, Y Xiao, Y Li, C Song, C Xu, C Tan, W Li arXiv preprint arXiv:2406.12266, 2024 | 6 | 2024 |
LIMO: Less is More for Reasoning Y Ye, Z Huang, Y Xiao, E Chern, S Xia, P Liu arXiv preprint arXiv:2502.03387, 2025 | 4 | 2025 |
Are All the Datasets in Benchmark Necessary? A Pilot Study of Dataset Evaluation for Text Classification Y Xiao, J Fu, SK Ng, P Liu arXiv preprint arXiv:2205.02129, 2022 | 2 | 2022 |