Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
A survey on evaluation of large language models
Large language models (LLMs) are gaining increasing popularity in both academia and
industry, owing to their unprecedented performance in various applications. As LLMs …
industry, owing to their unprecedented performance in various applications. As LLMs …
Task me anything
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously
assess the general capabilities of models instead of evaluating for a specific capability. As a …
assess the general capabilities of models instead of evaluating for a specific capability. As a …
Mass-producing failures of multimodal systems with language models
Deployed multimodal models can fail in ways that evaluators did not anticipate. In order to
find these failures before deployment, we introduce MultiMon, a system that automatically …
find these failures before deployment, we introduce MultiMon, a system that automatically …
Effective human-AI teams via learned natural language rules and onboarding
People are relying on AI agents to assist them with various tasks. The human must know
when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work …
when to rely on the agent, collaborate with the agent, or ignore its suggestions. In this work …
Dataset interfaces: Diagnosing model failures using controllable counterfactual generation
Distribution shift is a major source of failure for machine learning models. However,
evaluating model reliability under distribution shift can be challenging, especially since it …
evaluating model reliability under distribution shift can be challenging, especially since it …
Dyval: Dynamic evaluation of large language models for reasoning tasks
Large language models (LLMs) have achieved remarkable performance in various
evaluation benchmarks. However, concerns are raised about potential data contamination in …
evaluation benchmarks. However, concerns are raised about potential data contamination in …
Identification of systematic errors of image classifiers on rare subgroups
Despite excellent average-case performance of many image classifiers, their performance
can substantially deteriorate on semantically coherent subgroups of the data that were …
can substantially deteriorate on semantically coherent subgroups of the data that were …
Llm as dataset analyst: Subpopulation structure discovery with large language model
The distribution of subpopulations is an important property hidden within a dataset.
Uncovering and analyzing the subpopulation distribution within datasets provides a …
Uncovering and analyzing the subpopulation distribution within datasets provides a …
Dynamic evaluation of large language models by meta probing agents
Evaluation of large language models (LLMs) has raised great concerns in the community
due to the issue of data contamination. Existing work designed evaluation protocols using …
due to the issue of data contamination. Existing work designed evaluation protocols using …
Genome: generative neuro-symbolic visual reasoning by growing and reusing modules
Recent works have shown that Large Language Models (LLMs) could empower traditional
neuro-symbolic models via programming capabilities to translate language into module …
neuro-symbolic models via programming capabilities to translate language into module …