From generation to judgment: Opportunities and challenges of llm-as-a-judge
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …
and natural language processing (NLP). However, traditional methods, whether matching …
A Survey on LLM-as-a-Judge
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …
Llms-as-judges: a comprehensive survey on llm-based evaluation methods
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …
application across various fields. One of the most promising applications is their role as …
Hammerbench: Fine-grained function-calling evaluation in real mobile device scenarios
J Wang, J Zhou, M Wen, X Mo, H Zhang, Q Lin… - arxiv preprint arxiv …, 2024 - arxiv.org
Evaluating the capabilities of large language models (LLMs) in human-LLM interactions
remains challenging due to the inherent complexity and openness of dialogue processes …
remains challenging due to the inherent complexity and openness of dialogue processes …
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
Foundation models (FM), such as large language models (LLMs), which are large-scale
machine learning (ML) models, have demonstrated remarkable adaptability in various …
machine learning (ML) models, have demonstrated remarkable adaptability in various …
Multi-Agent Collaboration Mechanisms: A Survey of LLMs
With recent advances in Large Language Models (LLMs), Agentic AI has become
phenomenal in real-world applications, moving toward multiple LLM-based agents to …
phenomenal in real-world applications, moving toward multiple LLM-based agents to …
PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Automatically generating presentations from documents is a challenging task that requires
balancing content quality, visual design, and structural coherence. Existing methods …
balancing content quality, visual design, and structural coherence. Existing methods …
Agent-as-Judge for Factual Summarization of Long Narratives
Large Language Models (LLMs) have demonstrated near-human performance in
summarization tasks based on traditional metrics such as ROUGE and BERTScore …
summarization tasks based on traditional metrics such as ROUGE and BERTScore …
LLMs for Generation of Architectural Components: An Exploratory Empirical Study in the Serverless World
Recently, the exponential growth in capability and pervasiveness of Large Language
Models (LLMs) has led to significant work done in the field of code generation. However, this …
Models (LLMs) has led to significant work done in the field of code generation. However, this …
An Evaluation-Driven Approach to Designing LLM Agents: Process and Architecture
The advent of Large Language Models (LLMs) has enabled the development of LLM agents
capable of autonomously achieving under-specified goals and continuously evolving …
capable of autonomously achieving under-specified goals and continuously evolving …