Agent-as-a-judge: Evaluate agents with agents

M Zhuge, C Zhao, D Ashley, W Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Contemporary evaluation techniques are inadequate for agentic systems. These
approaches either focus exclusively on final outcomes--ignoring the step-by-step nature of …

Llms-as-judges: a comprehensive survey on llm-based evaluation methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Self-generated critiques boost reward modeling for language models

Y Yu, Z Chen, A Zhang, L Tan, C Zhu, RY Pang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …

Can LLMs Replace Manual Annotation of Software Engineering Artifacts?

T Ahmed, P Devanbu, C Treude, M Pradel - arxiv preprint arxiv …, 2024 - arxiv.org
Experimental evaluations of software engineering innovations, eg, tools and processes,
often include human-subject studies as a component of a multi-pronged strategy to obtain …

[HTML][HTML] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation

M Wysocka, O Wysocki, M Delmas, V Mutel… - Journal of Biomedical …, 2024 - Elsevier
Objective: The paper introduces a framework for the evaluation of the encoding of factual
scientific knowledge, designed to streamline the manual evaluation process typically …

Medic: Towards a comprehensive framework for evaluating llms in clinical applications

PK Kanithi, C Christophe, MAF Pimentel… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of Large Language Models (LLMs) for healthcare applications has
spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to …

Personalization of large language models: A survey

Z Zhang, RA Rossi, B Kveton, Y Shao, D Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
Personalization of Large Language Models (LLMs) has recently become increasingly
important with a wide range of applications. Despite the importance and recent progress …

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

FE Dorner, VY Nastl, M Hardt - arxiv preprint arxiv:2410.13341, 2024 - arxiv.org
High quality annotations are increasingly a bottleneck in the explosively growing machine
learning ecosystem. Scalable evaluation methods that avoid costly annotation have …

Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models

EL Ungless, N Vitsakis, Z Talat, J Garforth… - arxiv preprint arxiv …, 2024 - arxiv.org
This whitepaper offers an overview of the ethical considerations surrounding research into
or with large language models (LLMs). As LLMs become more integrated into widely used …

Cognitive overload attack: Prompt injection for long context

B Upadhayay, V Behzadan, A Karbasi - arxiv preprint arxiv:2410.11272, 2024 - arxiv.org
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing
tasks across various domains without needing explicit retraining. This capability, known as …