Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Are Large Language Models Consistent over Value-laden Questions?

J Moore, T Deshpande, D Yang - arxiv preprint arxiv:2407.02996, 2024 - arxiv.org
Large language models (LLMs) appear to bias their survey answers toward certain values.
Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are …

Steering without side effects: Improving post-deployment control of language models

AC Stickland, A Lyzhov, J Pfau, S Mahdi… - arxiv preprint arxiv …, 2024 - arxiv.org
Language models (LMs) have been shown to behave unexpectedly post-deployment. For
example, new jailbreaks continually arise, allowing model misuse, despite extensive red …

Calibraeval: Calibrating prediction distribution to mitigate selection bias in llms-as-judges

H Li, J Chen, Q Ai, Z Chu, Y Zhou, Q Dong… - arxiv preprint arxiv …, 2024 - arxiv.org
The use of large language models (LLMs) as automated evaluation tools to assess the
quality of generated natural language, known as LLMs-as-Judges, has demonstrated …

Looking Inward: Language Models Can Learn About Themselves by Introspection

FJ Binder, J Chua, T Korbak, H Sleight… - arxiv preprint arxiv …, 2024 - arxiv.org
Humans acquire knowledge by observing the external world, but also by introspection.
Introspection gives a person privileged access to their current state of mind (eg, thoughts …

Inference-Time-Compute: More Faithful? A Research Note

J Chua, O Evans - arxiv preprint arxiv:2501.08156, 2025 - arxiv.org
Models trained specifically to generate long Chains of Thought (CoTs) have recently
achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) …

Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning

K Moore, J Roberts, T Pham, D Fisher - arxiv preprint arxiv:2408.08651, 2024 - arxiv.org
Language models are known to absorb biases from their training data, leading to predictions
driven by statistical regularities rather than semantic relevance. We investigate the impact of …

[PDF][PDF] A philosophical inquiry on the effect of reasoning in AI models for bias and fairness

A Kapoor - philarchive.org
Advances in Artificial Intelligence have brought about an evolution of how reasoning has
been developed for modern AI models. I show how the process of human reinforcement has …