From generation to judgment: Opportunities and challenges of llm-as-a-judge

D Li, B Jiang, L Huang, A Beigi, C Zhao, Z Tan… - arxiv preprint arxiv …, 2024 - arxiv.org
Assessment and evaluation have long been critical challenges in artificial intelligence (AI)
and natural language processing (NLP). However, traditional methods, whether matching …

Large language models for data annotation and synthesis: A survey

Z Tan, D Li, S Wang, A Beigi, B Jiang… - arxiv preprint arxiv …, 2024 - arxiv.org
Data annotation and synthesis generally refers to the labeling or generating of raw data with
relevant information, which could be used for improving the efficacy of machine learning …

Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning

Y Tong, D Li, S Wang, Y Wang, F Teng… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent works have shown the benefits to LLMs from fine-tuning golden-standard Chain-of-
Thought (CoT) rationales or using them as correct examples in few-shot prompting. While …

Weak-to-strong reasoning

Y Yang, Y Ma, P Liu - arxiv preprint arxiv:2407.13647, 2024 - arxiv.org
When large language models (LLMs) exceed human-level capabilities, it becomes
increasingly challenging to provide full-scale and accurate supervision for these models …

Language Model Preference Evaluation with Multiple Weak Evaluators

Z Hu, J Zhang, Z **ong, A Ratner, H **ong… - arxiv preprint arxiv …, 2024 - arxiv.org
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs'
quality regarding preference remains a critical challenge. Existing works usually leverage a …