Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
Are Large Language Models Consistent over Value-laden Questions?
Large language models (LLMs) appear to bias their survey answers toward certain values.
Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are …
Nonetheless, some argue that LLMs are too inconsistent to simulate particular values. Are …
Steering without side effects: Improving post-deployment control of language models
Language models (LMs) have been shown to behave unexpectedly post-deployment. For
example, new jailbreaks continually arise, allowing model misuse, despite extensive red …
example, new jailbreaks continually arise, allowing model misuse, despite extensive red …
Calibraeval: Calibrating prediction distribution to mitigate selection bias in llms-as-judges
The use of large language models (LLMs) as automated evaluation tools to assess the
quality of generated natural language, known as LLMs-as-Judges, has demonstrated …
quality of generated natural language, known as LLMs-as-Judges, has demonstrated …
Looking Inward: Language Models Can Learn About Themselves by Introspection
Humans acquire knowledge by observing the external world, but also by introspection.
Introspection gives a person privileged access to their current state of mind (eg, thoughts …
Introspection gives a person privileged access to their current state of mind (eg, thoughts …
Inference-Time-Compute: More Faithful? A Research Note
Models trained specifically to generate long Chains of Thought (CoTs) have recently
achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) …
achieved impressive results. We refer to these models as Inference-Time-Compute (ITC) …
Reasoning Beyond Bias: A Study on Counterfactual Prompting and Chain of Thought Reasoning
Language models are known to absorb biases from their training data, leading to predictions
driven by statistical regularities rather than semantic relevance. We investigate the impact of …
driven by statistical regularities rather than semantic relevance. We investigate the impact of …
[PDF][PDF] A philosophical inquiry on the effect of reasoning in AI models for bias and fairness
A Kapoor - philarchive.org
Advances in Artificial Intelligence have brought about an evolution of how reasoning has
been developed for modern AI models. I show how the process of human reinforcement has …
been developed for modern AI models. I show how the process of human reinforcement has …