A survey on evaluation of large language models

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM transactions on …, 2024 - dl.acm.org
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Survey on factuality in large language models: Knowledge, retrieval and domain-specificity

C Wang, X Liu, Y Yue, X Tang, T Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
This survey addresses the crucial issue of factuality in Large Language Models (LLMs). As
LLMs find applications across diverse domains, the reliability and accuracy of their outputs …

[PDF][PDF] Trustllm: Trustworthiness in large language models

L Sun, Y Huang, H Wang, S Wu, Q Zhang… - arxiv preprint arxiv …, 2024 - mosis.eecs.utk.edu
Large language models (LLMs), exemplified by ChatGPT, have gained considerable
attention for their excellent natural language processing capabilities. Nonetheless, these …

Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization

Y Wang, Z Yu, Z Zeng, L Yang, C Wang, H Chen… - arxiv preprint arxiv …, 2023 - arxiv.org
Instruction tuning large language models (LLMs) remains a challenging task, owing to the
complexity of hyperparameter selection and the difficulty involved in evaluating the tuned …

[HTML][HTML] Position: TrustLLM: Trustworthiness in large language models

Y Huang, L Sun, H Wang, S Wu… - International …, 2024 - proceedings.mlr.press
Large language models (LLMs) have gained considerable attention for their excellent
natural language processing capabilities. Nonetheless, these LLMs present many …

Does fine-tuning LLMs on new knowledge encourage hallucinations?

Z Gekhman, G Yona, R Aharoni, M Eyal… - arxiv preprint arxiv …, 2024 - arxiv.org
When large language models are aligned via supervised fine-tuning, they may encounter
new factual information that was not acquired through pre-training. It is often conjectured that …

Investigating the factual knowledge boundary of large language models with retrieval augmentation

R Ren, Y Wang, Y Qu, WX Zhao, J Liu, H Tian… - arxiv preprint arxiv …, 2023 - arxiv.org
Knowledge-intensive tasks (eg, open-domain question answering (QA)) require a
substantial amount of factual knowledge and often rely on external information for …

Leave no document behind: Benchmarking long-context llms with extended multi-doc qa

M Wang, L Chen, F Cheng, S Liao… - Proceedings of the …, 2024 - aclanthology.org
Long-context modeling capabilities of Large Language Models (LLMs) have garnered
widespread attention, leading to the emergence of LLMs with ultra-context windows …

Beyond prompt brittleness: Evaluating the reliability and consistency of political worldviews in llms

T Ceron, N Falk, A Barić, D Nikolaev… - Transactions of the …, 2024 - direct.mit.edu
Due to the widespread use of large language models (LLMs), we need to understand
whether they embed a specific “worldview” and what these views reflect. Recent studies …

Unveiling the clinical incapabilities: a benchmarking study of GPT-4V (ision) for ophthalmic multimodal image analysis

P Xu, X Chen, Z Zhao, D Shi - British Journal of Ophthalmology, 2024 - bjo.bmj.com
Purpose To evaluate the capabilities and incapabilities of a GPT-4V (ision)-based chatbot in
interpreting ocular multimodal images. Methods We developed a digital ophthalmologist app …