Agent-as-a-judge: Evaluate agents with agents
Contemporary evaluation techniques are inadequate for agentic systems. These
approaches either focus exclusively on final outcomes--ignoring the step-by-step nature of …
approaches either focus exclusively on final outcomes--ignoring the step-by-step nature of …
Llms-as-judges: a comprehensive survey on llm-based evaluation methods
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …
application across various fields. One of the most promising applications is their role as …
Self-generated critiques boost reward modeling for language models
Reward modeling is crucial for aligning large language models (LLMs) with human
preferences, especially in reinforcement learning from human feedback (RLHF). However …
preferences, especially in reinforcement learning from human feedback (RLHF). However …
Can LLMs Replace Manual Annotation of Software Engineering Artifacts?
Experimental evaluations of software engineering innovations, eg, tools and processes,
often include human-subject studies as a component of a multi-pronged strategy to obtain …
often include human-subject studies as a component of a multi-pronged strategy to obtain …
[HTML][HTML] Large Language Models, scientific knowledge and factuality: A framework to streamline human expert evaluation
Objective: The paper introduces a framework for the evaluation of the encoding of factual
scientific knowledge, designed to streamline the manual evaluation process typically …
scientific knowledge, designed to streamline the manual evaluation process typically …
Medic: Towards a comprehensive framework for evaluating llms in clinical applications
The rapid development of Large Language Models (LLMs) for healthcare applications has
spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to …
spurred calls for holistic evaluation beyond frequently-cited benchmarks like USMLE, to …
Personalization of large language models: A survey
Personalization of Large Language Models (LLMs) has recently become increasingly
important with a wide range of applications. Despite the importance and recent progress …
important with a wide range of applications. Despite the importance and recent progress …
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
High quality annotations are increasingly a bottleneck in the explosively growing machine
learning ecosystem. Scalable evaluation methods that avoid costly annotation have …
learning ecosystem. Scalable evaluation methods that avoid costly annotation have …
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
This whitepaper offers an overview of the ethical considerations surrounding research into
or with large language models (LLMs). As LLMs become more integrated into widely used …
or with large language models (LLMs). As LLMs become more integrated into widely used …
Cognitive overload attack: Prompt injection for long context
Large Language Models (LLMs) have demonstrated remarkable capabilities in performing
tasks across various domains without needing explicit retraining. This capability, known as …
tasks across various domains without needing explicit retraining. This capability, known as …