- Academic Search

Y Chang, X Wang, J Wang, Y Wu, L Yang… - ACM transactions on …, 2024 - dl.acm.org

Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …

Uložit Citovat Počet citací tohoto článku: 2250 Související články Všechny verze (počet: 8)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Task me anything

J Zhang, W Huang, Z Ma, O Michel, D He… - arxiv preprint arxiv …, 2024 - arxiv.org

Benchmarks for large multimodal language models (MLMs) now serve to simultaneously
assess the general capabilities of models instead of evaluating for a specific capability. As a …

Uložit Citovat Počet citací tohoto článku: 42 Související články Všechny verze (počet: 7) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Mass-producing failures of multimodal systems with language models

S Tong, E Jones, J Steinhardt - Advances in Neural …, 2023 - proceedings.neurips.cc

Deployed multimodal models can fail in ways that evaluators did not anticipate. In order to
find these failures before deployment, we introduce MultiMon, a system that automatically …

Uložit Citovat Počet citací tohoto článku: 36 Související články Všechny verze (počet: 5) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Effective human-AI teams via learned natural language rules and onboarding

H Mozannar, J Lee, D Wei, P Sattigeri… - Advances in …, 2023 - proceedings.neurips.cc

People are relying on AI agents to assist them with various tasks. The human must know
when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work …

Uložit Citovat Počet citací tohoto článku: 17 Související články Všechny verze (počet: 8) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dataset interfaces: Diagnosing model failures using controllable counterfactual generation

J Vendrow, S Jain, L Engstrom, A Madry - arxiv preprint arxiv:2302.07865, 2023 - arxiv.org

Distribution shift is a major source of failure for machine learning models. However,
evaluating model reliability under distribution shift can be challenging, especially since it …

Uložit Citovat Počet citací tohoto článku: 36 Související články Všechny verze (počet: 3) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dyval: Dynamic evaluation of large language models for reasoning tasks

K Zhu, J Chen, J Wang, NZ Gong, D Yang… - arxiv preprint arxiv …, 2023 - arxiv.org

Large language models (LLMs) have achieved remarkable performance in various
evaluation benchmarks. However, concerns are raised about potential data contamination in …

Uložit Citovat Počet citací tohoto článku: 20 Související články Všechny verze (počet: 4) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Identification of systematic errors of image classifiers on rare subgroups

JH Metzen, R Hutmacher, NG Hua… - Proceedings of the …, 2023 - openaccess.thecvf.com

Despite excellent average-case performance of many image classifiers, their performance
can substantially deteriorate on semantically coherent subgroups of the data that were …

Uložit Citovat Počet citací tohoto článku: 19 Související články Všechny verze (počet: 4) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Llm as dataset analyst: Subpopulation structure discovery with large language model

Y Luo, R An, B Zou, Y Tang, J Liu, S Zhang - European Conference on …, 2024 - Springer

The distribution of subpopulations is an important property hidden within a dataset.
Uncovering and analyzing the subpopulation distribution within datasets provides a …

Uložit Citovat Počet citací tohoto článku: 5 Související články Všechny verze (počet: 8)

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Dynamic evaluation of large language models by meta probing agents

K Zhu, J Wang, Q Zhao, R Xu, X **e - arxiv preprint arxiv:2402.14865, 2024 - arxiv.org

Evaluation of large language models (LLMs) has raised great concerns in the community
due to the issue of data contamination. Existing work designed evaluation protocols using …

Uložit Citovat Počet citací tohoto článku: 9 Související články Všechny verze (počet: 6) Zobrazit jako HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Genome: generative neuro-symbolic visual reasoning by growing and reusing modules

Z Chen, R Sun, W Liu, Y Hong, C Gan - arxiv preprint arxiv:2311.04901, 2023 - arxiv.org

Recent works have shown that Large Language Models (LLMs) could empower traditional
neuro-symbolic models via programming capabilities to translate language into module …

Uložit Citovat Počet citací tohoto článku: 16 Související články Všechny verze (počet: 4) Zobrazit jako HTML

Vytvořit upozornění

Citovat

Rozšířené vyhledávání

Uloženo do Mojí knihovny

Adaptive testing of computer vision models

A survey on evaluation of large language models

Task me anything

Mass-producing failures of multimodal systems with language models

Effective human-AI teams via learned natural language rules and onboarding

Dataset interfaces: Diagnosing model failures using controllable counterfactual generation

Dyval: Dynamic evaluation of large language models for reasoning tasks

Identification of systematic errors of image classifiers on rare subgroups

Llm as dataset analyst: Subpopulation structure discovery with large language model

Dynamic evaluation of large language models by meta probing agents

Genome: generative neuro-symbolic visual reasoning by growing and reusing modules