Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems

L Von Rueden, S Mayer, K Beckh… - … on Knowledge and …, 2021 - ieeexplore.ieee.org
Despite its great success, machine learning can have its limits when dealing with insufficient
training data. A potential solution is the additional integration of prior knowledge into the …

Scalable agent alignment via reward modeling: a research direction

J Leike, D Krueger, T Everitt, M Martic, V Maini… - arxiv preprint arxiv …, 2018 - arxiv.org
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …

Direct preference optimization: Your language model is secretly a reward model

R Rafailov, A Sharma, E Mitchell… - Advances in …, 2023 - proceedings.neurips.cc
While large-scale unsupervised language models (LMs) learn broad world knowledge and
some reasoning skills, achieving precise control of their behavior is difficult due to the …

Statistical rejection sampling improves preference optimization

T Liu, Y Zhao, R Joshi, M Khalman, M Saleh… - arxiv preprint arxiv …, 2023 - arxiv.org
Improving the alignment of language models with human preferences remains an active
research challenge. Previous approaches have primarily utilized Reinforcement Learning …

Fine-tuning language models from human preferences

DM Ziegler, N Stiennon, J Wu, TB Brown… - arxiv preprint arxiv …, 2019 - arxiv.org
Reward learning enables the application of reinforcement learning (RL) to tasks where
reward is defined by human judgment, building a model of reward by asking humans …

Kto: Model alignment as prospect theoretic optimization

K Ethayarajh, W Xu, N Muennighoff, D Jurafsky… - arxiv preprint arxiv …, 2024 - arxiv.org
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner; for example, humans are famously loss …

Deep reinforcement learning for sequence-to-sequence models

Y Keneshloo, T Shi, N Ramakrishnan… - IEEE transactions on …, 2019 - ieeexplore.ieee.org
In recent times, sequence-to-sequence (seq2seq) models have gained a lot of popularity
and provide state-of-the-art performance in a wide variety of tasks, such as machine …

Better rewards yield better summaries: Learning to summarise without references

F Böhm, Y Gao, CM Meyer, O Shapira, I Dagan… - arxiv preprint arxiv …, 2019 - arxiv.org
Reinforcement Learning (RL) based document summarisation systems yield state-of-the-art
performance in terms of ROUGE scores, because they directly use ROUGE as the rewards …

Smaug: Fixing failure modes of preference optimisation with dpo-positive

A Pal, D Karkhanis, S Dooley, M Roberts… - arxiv preprint arxiv …, 2024 - arxiv.org
Direct Preference Optimisation (DPO) is effective at significantly improving the performance
of large language models (LLMs) on downstream tasks such as reasoning, summarisation …

Aligning language models with human preferences via a bayesian approach

J Wang, H Wang, S Sun, W Li - Advances in Neural …, 2024 - proceedings.neurips.cc
In the quest to advance human-centric natural language generation (NLG) systems,
ensuring alignment between NLG models and human preferences is crucial. For this …