Google Академія

S Wang, S Zhang, J Zhang, R Hu, X Li, T Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

This paper surveys research in the rapidly growing field of enhancing large language
models (LLMs) with reinforcement learning (RL), a technique that enables LLMs to improve …

Зберегти Послатися Цитовано в 2 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards a unified view of preference learning for large language models: A survey

B Gao, F Song, Y Miao, Z Cai, Z Yang, L Chen… - arxiv preprint arxiv …, 2024 - arxiv.org

Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial
factors to achieve success is aligning the LLM's output with human preferences. This …

Зберегти Послатися Цитовано в 6 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Acemath: Advancing frontier math reasoning with post-training and reward modeling

Z Liu, Y Chen, M Shoeybi, B Catanzaro… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we introduce AceMath, a suite of frontier math models that excel in solving
complex math problems, along with highly effective reward models capable of evaluating …

Зберегти Послатися Цитовано в 4 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Free process rewards without process labels

L Yuan, W Li, H Chen, G Cui, N Ding, K Zhang… - arxiv preprint arxiv …, 2024 - arxiv.org

Different from its counterpart outcome reward models (ORMs), which evaluate the entire
responses, a process reward model (PRM) scores a reasoning trajectory step by step …

Зберегти Послатися Цитовано в 9 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

JuStRank: Benchmarking LLM Judges for System Ranking

A Gera, O Boni, Y Perlitz, R Bar-Haim, L Eden… - arxiv preprint arxiv …, 2024 - arxiv.org

Given the rapid progress of generative AI, there is a pressing need to systematically
compare and choose between the numerous models and configurations available. The …

Зберегти Послатися Цитовано в 2 джерелах Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems

H Shaik, A Doboli - arxiv preprint arxiv:2501.00562, 2024 - arxiv.org

Large Language Models offer new opportunities to devise automated implementation
generation methods that can tackle problem solving activities beyond traditional methods …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

L Dou, Q Liu, F Zhou, C Chen, Z Wang, Z **… - arxiv preprint arxiv …, 2025 - arxiv.org

Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA)
languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

InternLM-XComposer2. 5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Y Zang, X Dong, P Zhang, Y Cao, Z Liu, S Ding… - arxiv preprint arxiv …, 2025 - arxiv.org

Despite the promising performance of Large Vision Language Models (LVLMs) in visual
understanding, they occasionally generate incorrect outputs. While reward models (RMs) …

Зберегти Послатися Пов’язані статті Кількість версій: 2 Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Less is More: Improving LLM Alignment via Preference Data Selection

X Deng, H Zhong, R Ai, F Feng, Z Wang… - arxiv preprint arxiv …, 2025 - arxiv.org

Direct Preference Optimization (DPO) has emerged as a promising approach for aligning
large language models with human preferences. While prior work mainly extends DPO from …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards

X Yang, L Zeng, H Dong, C Yu, X Wu, H Yang… - arxiv preprint arxiv …, 2025 - arxiv.org

As humans increasingly share environments with diverse agents powered by RL, LLMs, and
beyond, the ability to explain their policies in natural language will be vital for reliable …

Зберегти Послатися Пов’язані статті Показати у форматі HTML

Створити сповіщення

Послатися

Розширений пошук

Збережено в моїй бібліотеці

Skywork-reward: Bag of tricks for reward modeling in llms

Reinforcement Learning Enhanced LLMs: A Survey

Towards a unified view of preference learning for large language models: A survey

Acemath: Advancing frontier math reasoning with post-training and reward modeling

Free process rewards without process labels

JuStRank: Benchmarking LLM Judges for System Ranking

An Overview and Discussion on Using Large Language Models for Implementation Generation of Solutions to Open-Ended Problems

Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs

InternLM-XComposer2. 5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Less is More: Improving LLM Alignment via Preference Data Selection

Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards