Informed machine learning–a taxonomy and survey of integrating prior knowledge into learning systems
Despite its great success, machine learning can have its limits when dealing with insufficient
training data. A potential solution is the additional integration of prior knowledge into the …
training data. A potential solution is the additional integration of prior knowledge into the …
Scalable agent alignment via reward modeling: a research direction
One obstacle to applying reinforcement learning algorithms to real-world problems is the
lack of suitable reward functions. Designing such reward functions is difficult in part because …
lack of suitable reward functions. Designing such reward functions is difficult in part because …
Direct preference optimization: Your language model is secretly a reward model
While large-scale unsupervised language models (LMs) learn broad world knowledge and
some reasoning skills, achieving precise control of their behavior is difficult due to the …
some reasoning skills, achieving precise control of their behavior is difficult due to the …
Statistical rejection sampling improves preference optimization
Improving the alignment of language models with human preferences remains an active
research challenge. Previous approaches have primarily utilized Reinforcement Learning …
research challenge. Previous approaches have primarily utilized Reinforcement Learning …
Fine-tuning language models from human preferences
Reward learning enables the application of reinforcement learning (RL) to tasks where
reward is defined by human judgment, building a model of reward by asking humans …
reward is defined by human judgment, building a model of reward by asking humans …
Kto: Model alignment as prospect theoretic optimization
Kahneman & Tversky's $\textit {prospect theory} $ tells us that humans perceive random
variables in a biased but well-defined manner; for example, humans are famously loss …
variables in a biased but well-defined manner; for example, humans are famously loss …
Deep reinforcement learning for sequence-to-sequence models
In recent times, sequence-to-sequence (seq2seq) models have gained a lot of popularity
and provide state-of-the-art performance in a wide variety of tasks, such as machine …
and provide state-of-the-art performance in a wide variety of tasks, such as machine …
Better rewards yield better summaries: Learning to summarise without references
Reinforcement Learning (RL) based document summarisation systems yield state-of-the-art
performance in terms of ROUGE scores, because they directly use ROUGE as the rewards …
performance in terms of ROUGE scores, because they directly use ROUGE as the rewards …
Smaug: Fixing failure modes of preference optimisation with dpo-positive
Direct Preference Optimisation (DPO) is effective at significantly improving the performance
of large language models (LLMs) on downstream tasks such as reasoning, summarisation …
of large language models (LLMs) on downstream tasks such as reasoning, summarisation …
Aligning language models with human preferences via a bayesian approach
In the quest to advance human-centric natural language generation (NLG) systems,
ensuring alignment between NLG models and human preferences is crucial. For this …
ensuring alignment between NLG models and human preferences is crucial. For this …