From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

Llms-as-judges: a comprehensive survey on llm-based evaluation methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Hammerbench: Fine-grained function-calling evaluation in real mobile device scenarios

J Wang, J Zhou, M Wen, X Mo, H Zhang, Q Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating the capabilities of large language models (LLMs) in human-LLM interactions
remains challenging due to the inherent complexity and openness of dialogue processes …

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Z Zhao, AA Bangash, FR Côgo… - IEEE Transactions …, 2025 - ieeexplore.ieee.org
Foundation models (FM), such as large language models (LLMs), which are large-scale
machine learning (ML) models, have demonstrated remarkable adaptability in various …

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

KT Tran, D Dao, MD Nguyen, QV Pham… - arxiv preprint arxiv …, 2025 - arxiv.org
With recent advances in Large Language Models (LLMs), Agentic AI has become
phenomenal in real-world applications, moving toward multiple LLM-based agents to …

PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides

H Zheng, X Guan, H Kong, J Zheng, H Lin, Y Lu… - arxiv preprint arxiv …, 2025 - arxiv.org
Automatically generating presentations from documents is a challenging task that requires
balancing content quality, visual design, and structural coherence. Existing methods …

Agent-as-Judge for Factual Summarization of Long Narratives

Y Jeong, M Kim, S Hwang, BH Kim - arxiv preprint arxiv:2501.09993, 2025 - arxiv.org
Large Language Models (LLMs) have demonstrated near-human performance in
summarization tasks based on traditional metrics such as ROUGE and BERTScore …

LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World

S Arun, M Tedla, K Vaidhyanathan - arxiv preprint arxiv:2502.02539, 2025 - arxiv.org
Recently, the exponential growth in capability and pervasiveness of Large Language
Models (LLMs) has led to significant work done in the field of code generation. However, this …

An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture

B **a, Q Lu, L Zhu, Z **ng, D Zhao, H Zhang - arxiv preprint arxiv …, 2024 - arxiv.org
The advent of Large Language Models (LLMs) has enabled the development of LLM agents
capable of autonomously achieving under-specified goals and continuously evolving …