A survey on evaluation of large language models

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM transactions on …, 2024 - dl.acm.org
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Task me anything

J Zhang, W Huang, Z Ma, O Michel, D He… - arxiv preprint arxiv …, 2024 - arxiv.org
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously
assess the general capabilities of models instead of evaluating for a specific capability. As a …

Mass-producing failures of multimodal systems with language models

S Tong, E Jones, J Steinhardt - Advances in Neural …, 2023 - proceedings.neurips.cc
Deployed multimodal models can fail in ways that evaluators did not anticipate. In order to
find these failures before deployment, we introduce MultiMon, a system that automatically …

Effective human-AI teams via learned natural language rules and onboarding

H Mozannar, J Lee, D Wei, P Sattigeri… - Advances in …, 2023 - proceedings.neurips.cc
People are relying on AI agents to assist them with various tasks. The human must know
when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work …

Dataset interfaces: Diagnosing model failures using controllable counterfactual generation

J Vendrow, S Jain, L Engstrom, A Madry - arxiv preprint arxiv:2302.07865, 2023 - arxiv.org
Distribution shift is a major source of failure for machine learning models. However,
evaluating model reliability under distribution shift can be challenging, especially since it …

Dyval: Dynamic evaluation of large language models for reasoning tasks

K Zhu, J Chen, J Wang, NZ Gong, D Yang… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have achieved remarkable performance in various
evaluation benchmarks. However, concerns are raised about potential data contamination in …

Identification of systematic errors of image classifiers on rare subgroups

JH Metzen, R Hutmacher, NG Hua… - Proceedings of the …, 2023 - openaccess.thecvf.com
Despite excellent average-case performance of many image classifiers, their performance
can substantially deteriorate on semantically coherent subgroups of the data that were …

Llm as dataset analyst: Subpopulation structure discovery with large language model

Y Luo, R An, B Zou, Y Tang, J Liu, S Zhang - European Conference on …, 2024 - Springer
The distribution of subpopulations is an important property hidden within a dataset.
Uncovering and analyzing the subpopulation distribution within datasets provides a …

Dynamic evaluation of large language models by meta probing agents

K Zhu, J Wang, Q Zhao, R Xu, X **e - arxiv preprint arxiv:2402.14865, 2024 - arxiv.org
Evaluation of large language models (LLMs) has raised great concerns in the community
due to the issue of data contamination. Existing work designed evaluation protocols using …

Genome: generative neuro-symbolic visual reasoning by growing and reusing modules

Z Chen, R Sun, W Liu, Y Hong, C Gan - arxiv preprint arxiv:2311.04901, 2023 - arxiv.org
Recent works have shown that Large Language Models (LLMs) could empower traditional
neuro-symbolic models via programming capabilities to translate language into module …