A survey on evaluation of large language models

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM Transactions on …, 2024 - dl.acm.org
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Using large language models in psychology

D Demszky, D Yang, DS Yeager, CJ Bryan… - Nature Reviews …, 2023 - nature.com
Large language models (LLMs), such as OpenAI's GPT-4, Google's Bard or Meta's LLaMa,
have created unprecedented opportunities for analysing and generating language data on a …

A metaverse: Taxonomy, components, applications, and open challenges

SM Park, YG Kim - IEEE access, 2022 - ieeexplore.ieee.org
Unlike previous studies on the Metaverse based on Second Life, the current Metaverse is
based on the social value of Generation Z that online and offline selves are not different …

BLEURT: Learning robust metrics for text generation

T Sellam, D Das, AP Parikh - arxiv preprint arxiv:2004.04696, 2020 - arxiv.org
Text generation has made significant advances in the last few years. Yet, evaluation metrics
have lagged behind, as the most popular choices (eg, BLEU and ROUGE) may correlate …

HotpotQA: A dataset for diverse, explainable multi-hop question answering

Z Yang, P Qi, S Zhang, Y Bengio, WW Cohen… - arxiv preprint arxiv …, 2018 - arxiv.org
Existing question answering (QA) datasets fail to train QA systems to perform complex
reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset …

Towards a human-like open-domain chatbot

D Adiwardana, MT Luong, DR So, J Hall… - arxiv preprint arxiv …, 2020 - arxiv.org
We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and
filtered from public domain social media conversations. This 2.6 B parameter neural network …

Chateval: Towards better llm-based evaluators through multi-agent debate

CM Chan, W Chen, Y Su, J Yu, W Xue, S Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
Text evaluation has historically posed significant challenges, often demanding substantial
labor and time cost. With the emergence of large language models (LLMs), researchers …

All that's' human'is not gold: Evaluating human evaluation of generated text

E Clark, T August, S Serrano, N Haduong… - arxiv preprint arxiv …, 2021 - arxiv.org
Human evaluations are typically considered the gold standard in natural language
generation, but as models' fluency improves, how well can evaluators detect and judge …

MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance

W Zhao, M Peyrard, F Liu, Y Gao, CM Meyer… - arxiv preprint arxiv …, 2019 - arxiv.org
A robust evaluation metric has a profound impact on the development of text generation
systems. A desirable metric compares system output against references based on their …

[HTML][HTML] Advances and challenges in conversational recommender systems: A survey

C Gao, W Lei, X He, M de Rijke, TS Chua - AI open, 2021 - Elsevier
Recommender systems exploit interaction history to estimate user preference, having been
heavily used in a wide range of industry applications. However, static recommendation …