- Academic Search

Y Liu, H Zhou, Z Guo, E Shareghi, I Vulić… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) have demonstrated promising capabilities as automatic
evaluators in assessing the quality of generated natural language. However, LLMs still …

Gem Citer Citeret af 39 Relaterede artikler Alle 7 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Hellobench: Evaluating long text generation capabilities of large language models

H Que, F Duan, L He, Y Mou, W Zhou, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities
in various tasks (eg, long-context understanding), and many benchmarks have been …

Gem Citer Citeret af 10 Relaterede artikler Alle 4 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Debateqa: Evaluating question answering on debatable knowledge

R Xu, X Qi, Z Qi, W Xu, Z Guo - arxiv preprint arxiv:2408.01419, 2024 - arxiv.org

The rise of large language models (LLMs) has enabled us to seek answers to inherently
debatable questions on LLM chatbots, necessitating a reliable way to evaluate their ability …

Gem Citer Citeret af 4 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

X Wu, M Wang, Y Liu, X Shi, H Yan, X Lu, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org

As Large Language Models (LLMs) continue to advance in natural language processing
(NLP), their ability to stably follow instructions in long-context inputs has become crucial for …

Gem Citer Citeret af 1 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LCFO: Long context and long form output dataset and benchmarking

MR Costa-jussà, P Andrews, MC Meglioli… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper presents the Long Context and Form Output (LCFO) benchmark, a novel
evaluation framework for assessing gradual summarization and summary expansion …

Gem Citer Citeret af 1 Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

FormalAlign: Automated Alignment Evaluation for Autoformalization

J Lu, Y Wan, Y Huang, J **ong, Z Liu, Z Guo - arxiv preprint arxiv …, 2024 - arxiv.org

Autoformalization aims to convert informal mathematical proofs into machine-verifiable
formats, bridging the gap between natural and formal languages. However, ensuring …

Gem Citer Relaterede artikler Alle 3 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

LongRAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

Z Qi, R Xu, Z Guo, C Wang, H Zhang, W Xu - arxiv preprint arxiv …, 2024 - arxiv.org

Retrieval-augmented generation (RAG) is a promising approach to address the limitations of
fixed knowledge in large language models (LLMs). However, current benchmarks for …

Gem Citer Relaterede artikler Alle 2 versioner Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

SurveyX: Academic Survey Automation via Large Language Models

X Liang, J Yang, Y Wang, C Tang, Z Zheng… - arxiv preprint arxiv …, 2025 - arxiv.org

Large Language Models (LLMs) have demonstrated exceptional comprehension
capabilities and a vast knowledge base, suggesting that LLMs can serve as efficient tools for …

Gem Citer Relaterede artikler Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

K Wan, H Mu, R Hao, H Luo, T Gu, X Chen - arxiv preprint arxiv …, 2025 - arxiv.org

Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form
text that adheres to strict requirements in a single pass. This challenge is unsurprising, as …

Gem Citer Relaterede artikler Vis som HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments

I De la Iglesia, I Goenaga, J Ramirez-Romero… - arxiv preprint arxiv …, 2024 - arxiv.org

Evaluating LLM-generated text has become a key challenge, especially in domain-specific
contexts like the medical field. This work introduces a novel evaluation methodology for LLM …

Gem Citer Relaterede artikler Alle 3 versioner Vis som HTML

Opret underretning

Citer

Avanceret søgning

Gemt i Min samling

Proxyqa: An alternative framework for evaluating long-form text generation with large language...

Aligning with human judgement: The role of pairwise preference in large language model evaluators

Hellobench: Evaluating long text generation capabilities of large language models

Debateqa: Evaluating question answering on debatable knowledge

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

LCFO: Long context and long form output dataset and benchmarking

FormalAlign: Automated Alignment Evaluation for Autoformalization

LongRAG: Evaluating Long-Context & Long-Form Retrieval-Augmented Generation with Key Point Recall

SurveyX: Academic Survey Automation via Large Language Models

A Cognitive Writing Perspective for Constrained Long-Form Text Generation

Ranking Over Scoring: Towards Reliable and Robust Automated Evaluation of LLM-Generated Medical Explanatory Arguments