A survey on evaluation of large language models

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM transactions on …, 2024 - dl.acm.org
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Challenges and applications of large language models

J Kaddour, J Harris, M Mozes, H Bradley… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) went from non-existent to ubiquitous in the machine
learning discourse within a few years. Due to the fast pace of the field, it is difficult to identify …

Chateval: Towards better llm-based evaluators through multi-agent debate

CM Chan, W Chen, Y Su, J Yu, W Xue, S Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Text evaluation has historically posed significant challenges, often demanding substantial
labor and time cost. With the emergence of large language models (LLMs), researchers …

Large language models are not fair evaluators

P Wang, L Li, L Chen, Z Cai, D Zhu, B Lin… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we uncover a systematic bias in the evaluation paradigm of adopting large
language models~(LLMs), eg, GPT-4, as a referee to score and compare the quality of …

Unleashing the potential of prompt engineering in large language models: a comprehensive review

B Chen, Z Zhang, N Langrené, S Zhu - arxiv preprint arxiv:2310.14735, 2023 - arxiv.org
This comprehensive review delves into the pivotal role of prompt engineering in unleashing
the capabilities of Large Language Models (LLMs). The development of Artificial Intelligence …

Aligning large language models with human: A survey

Y Wang, W Zhong, L Li, F Mi, X Zeng, W Huang… - arxiv preprint arxiv …, 2023 - arxiv.org
Large Language Models (LLMs) trained on extensive textual corpora have emerged as
leading solutions for a broad array of Natural Language Processing (NLP) tasks. Despite …

[HTML][HTML] Fine-tuning ChatGPT for automatic scoring

E Latif, X Zhai - Computers and Education: Artificial Intelligence, 2024 - Elsevier
This study highlights the potential of fine-tuned ChatGPT (GPT-3.5) for automatically scoring
student written constructed responses using example assessment tasks in science …

Large language model alignment: A survey

T Shen, R **, Y Huang, C Liu, W Dong, Z Guo… - arxiv preprint arxiv …, 2023 - arxiv.org
Recent years have witnessed remarkable progress made in large language models (LLMs).
Such advancements, while garnering significant attention, have concurrently elicited various …

Evaluating large language models at evaluating instruction following

Z Zeng, J Yu, T Gao, Y Meng, T Goyal… - arxiv preprint arxiv …, 2023 - arxiv.org
As research in large language models (LLMs) continues to accelerate, LLM-based
evaluation has emerged as a scalable and cost-effective alternative to human evaluations …

Prometheus: Inducing fine-grained evaluation capability in language models

S Kim, J Shin, Y Cho, J Jang, S Longpre… - The Twelfth …, 2023 - openreview.net
Recently, GPT-4 has become the de facto evaluator for long-form text generated by large
language models (LLMs). However, for practitioners and researchers with large and custom …