A survey on evaluation of large language models

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM Transactions on …, 2024 - dl.acm.org
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Scientific large language models: A survey on biological & chemical domains

Q Zhang, K Ding, T Lv, X Wang, Q Yin, Y Zhang… - ACM Computing …, 2024 - dl.acm.org
Large Language Models (LLMs) have emerged as a transformative power in enhancing
natural language comprehension, representing a significant stride toward artificial general …

A survey of large language models

WX Zhao, K Zhou, J Li, T Tang, X Wang, Y Hou… - arxiv preprint arxiv …, 2023 - arxiv.org
Language is essentially a complex, intricate system of human expressions governed by
grammatical rules. It poses a significant challenge to develop capable AI algorithms for …

Inadequacies of large language model benchmarks in the era of generative artificial intelligence

TR McIntosh, T Susnjak, N Arachchilage, T Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities
has spurred public curiosity to evaluate and compare different LLMs, leading many …

Evalcrafter: Benchmarking and evaluating large video generation models

Y Liu, X Cun, X Liu, X Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com
The vision and language generative models have been overgrown in recent years. For
video generation various open-sourced models and public-available services have been …

Superclue: A comprehensive chinese large language model benchmark

L Xu, A Li, L Zhu, H Xue, C Zhu, K Zhao, H He… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have shown the potential to be integrated into human daily
lives. Therefore, user preference is the most critical criterion for assessing LLMs' …

Learning or self-aligning? rethinking instruction fine-tuning

M Ren, B Cao, H Lin, C Liu, X Han, K Zeng… - arxiv preprint arxiv …, 2024 - arxiv.org
Instruction Fine-tuning~(IFT) is a critical phase in building large language models~(LLMs).
Previous works mainly focus on the IFT's role in the transfer of behavioral norms and the …

Sciassess: Benchmarking llm proficiency in scientific literature analysis

H Cai, X Cai, J Chang, S Li, L Yao, C Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent breakthroughs in Large Language Models (LLMs) have revolutionized scientific
literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency …

Can Large Language Models Understand Real-World Complex Instructions?

Q He, J Zeng, W Huang, L Chen, J **ao, Q He… - Proceedings of the …, 2024 - ojs.aaai.org
Large language models (LLMs) can understand human instructions, showing their potential
for pragmatic applications beyond traditional NLP tasks. However, they still struggle with …

SeaEval for multilingual foundation models: From cross-lingual alignment to cultural reasoning

B Wang, Z Liu, X Huang, F Jiao, Y Ding, AT Aw… - arxiv preprint arxiv …, 2023 - arxiv.org
We present SeaEval, a benchmark for multilingual foundation models. In addition to
characterizing how these models understand and reason with natural language, we also …