Llms-as-judges: a comprehensive survey on llm-based evaluation methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Reef: Representation encoding fingerprints for large language models

J Zhang, D Liu, C Qian, L Zhang, Y Liu, Y Qiao… - arxiv preprint arxiv …, 2024 - arxiv.org
Protecting the intellectual property of open-source Large Language Models (LLMs) is very
important, because training LLMs costs extensive computational resources and data …

Synthesizing post-training data for llms through multi-agent simulation

S Tang, X Pang, Z Liu, B Tang, R Ye, X Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
Post-training is essential for enabling large language models (LLMs) to follow human
instructions. Inspired by the recent success of using LLMs to simulate human society, we …

Align anything: Training all-modality models to follow instructions with language feedback

J Ji, J Zhou, H Lou, B Chen, D Hong, X Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Reinforcement learning from human feedback (RLHF) has proven effective in enhancing the
instruction-following capabilities of large language models; however, it remains …

Vlsbench: Unveiling visual leakage in multimodal safety

X Hu, D Liu, H Li, X Huang, J Shao - arxiv preprint arxiv:2411.19939, 2024 - arxiv.org
Safety concerns of Multimodal large language models (MLLMs) have gradually become an
important problem in various applications. Surprisingly, previous works indicate a counter …

Position: Llm unlearning benchmarks are weak measures of progress

P Thaker, S Hu, N Kale, Y Maurya, ZS Wu… - arxiv preprint arxiv …, 2024 - arxiv.org
Unlearning methods have the potential to improve the privacy and safety of large language
models (LLMs) by removing sensitive or harmful information post hoc. The LLM unlearning …

Beyond Scalar Reward Model: Learning Generative Judge from Preference Data

Z Ye, X Li, Q Li, Q Ai, Y Zhou, W Shen, D Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
Learning from preference feedback is a common practice for aligning large language
models~(LLMs) with human value. Conventionally, preference data is learned and encoded …

Course-correction: Safety alignment using synthetic preferences

R Xu, Y Cai, Z Zhou, R Gu, H Weng, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The risk of harmful content generated by large language models (LLMs) becomes a critical
concern. This paper presents a systematic study on assessing and improving LLMs' …

[PDF][PDF] Aligner: Efficient alignment by learning to correct

J Ji, B Chen, H Lou, D Hong, B Zhang… - arxiv preprint arxiv …, 2024 - beta.ai-plans.com
With the rapid development of large language models (LLMs) and ever-evolving practical
requirements, finding an efficient and effective alignment method has never been more …

Targeted manipulation and deception emerge when optimizing llms for user feedback

M Williams, M Carroll, A Narang, C Weisser… - arxiv preprint arxiv …, 2024 - arxiv.org
As LLMs become more widely deployed, there is increasing interest in directly optimizing for
feedback from end users (eg thumbs up) in addition to feedback from paid annotators …