- Academic Search

A Kumar, V Zhuang, R Agarwal, Y Su… - arxiv preprint arxiv …, 2024 - arxiv.org

Self-correction is a highly desirable capability of large language models (LLMs), yet it has
consistently been found to be largely ineffective in modern LLMs. Current methods for …

Speichern Zitieren Zitiert von: 41 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Simpo: Simple preference optimization with a reference-free reward

Y Meng, M **a, D Chen - arxiv preprint arxiv:2405.14734, 2024 - arxiv.org

Direct Preference Optimization (DPO) is a widely used offline preference optimization
algorithm that reparameterizes reward functions in reinforcement learning from human …

Speichern Zitieren Zitiert von: 179 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Rewarding progress: Scaling automated process verifiers for llm reasoning

A Setlur, C Nagpal, A Fisch, X Geng… - arxiv preprint arxiv …, 2024 - arxiv.org

A promising approach for improving reasoning in large language models is to use process
reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace …

Speichern Zitieren Zitiert von: 19 Ähnliche Artikel Alle 2 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Training large language models to reason in a continuous latent space

S Hao, S Sukhbaatar, DJ Su, X Li, Z Hu… - arxiv preprint arxiv …, 2024 - arxiv.org

Large language models (LLMs) are restricted to reason in the" language space", where they
typically express the reasoning process with a chain-of-thought (CoT) to solve a complex …

Speichern Zitieren Zitiert von: 9 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Iterative reasoning preference optimization

RY Pang, W Yuan, K Cho, H He, S Sukhbaatar… - arxiv preprint arxiv …, 2024 - arxiv.org

Iterative preference optimization methods have recently been shown to perform well for
general instruction tuning tasks, but typically make little improvement on reasoning tasks …

Speichern Zitieren Zitiert von: 79 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Large language models assume people are more rational than we really are

R Liu, J Geng, JC Peterson, I Sucholutsky… - arxiv preprint arxiv …, 2024 - arxiv.org

In order for AI systems to communicate effectively with people, they must understand how we
make decisions. However, people's decisions are not always rational, so the implicit internal …

Speichern Zitieren Zitiert von: 7 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling

J Qiu, Y Lu, Y Zeng, J Guo, J Geng, H Wang… - arxiv preprint arxiv …, 2024 - arxiv.org

Inference-time alignment enhances the performance of large language models without
requiring additional training or fine-tuning but presents challenges due to balancing …

Speichern Zitieren Zitiert von: 5 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic

X Zheng, J Lou, B Cao, X Wen, Y Ji, H Lin, Y Lu… - arxiv preprint arxiv …, 2024 - arxiv.org

Self-critic has become a crucial mechanism for enhancing the reasoning performance of
LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel Alle 3 Versionen HTML-Version

[Free GPT-4]

[PDF] arxiv.org

From lists to emojis: How format bias affects model alignment

X Zhang, W **ong, L Chen, T Zhou, H Huang… - arxiv preprint arxiv …, 2024 - arxiv.org

In this paper, we study format biases in reinforcement learning from human feedback
(RLHF). We observe that many widely-used preference models, including human …

Speichern Zitieren Zitiert von: 4 Ähnliche Artikel HTML-Version

[Free GPT-4]

[PDF] arxiv.org

Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective

Z Zeng, Q Cheng, Z Yin, B Wang, S Li, Y Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org

OpenAI o1 represents a significant milestone in Artificial Inteiligence, which achieves expert-
level performances on many challanging tasks that require strong reasoning ability. OpenAI …

Speichern Zitieren Zitiert von: 2 Ähnliche Artikel HTML-Version

Alert erstellen

Zitieren

Erweiterte Suche

In „Meine Bibliothek“ gespeichert

Teaching large language models to reason with reinforcement learning

Training language models to self-correct via reinforcement learning

Simpo: Simple preference optimization with a reference-free reward

Rewarding progress: Scaling automated process verifiers for llm reasoning

Training large language models to reason in a continuous latent space

Iterative reasoning preference optimization

Large language models assume people are more rational than we really are

Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling

Critic-cot: Boosting the reasoning abilities of large language model via chain-of-thoughts critic

From lists to emojis: How format bias affects model alignment

Scaling of Search and Learning: A Roadmap to Reproduce o1 from Reinforcement Learning Perspective