Ai alignment: A comprehensive survey
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …
Foundational challenges in assuring alignment and safety of large language models
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …
language models (LLMs). These challenges are organized into three different categories …
The clock and the pizza: Two stories in mechanistic explanation of neural networks
Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …
algorithms? Several recent studies, on tasks ranging from group operations to in-context …
Assessing the brittleness of safety alignment via pruning and low-rank modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …
Cold fusion: Collaborative descent for distributed multitask finetuning
We propose a new paradigm to continually evolve pretrained models, denoted ColD Fusion.
It provides the benefits of multitask learning but leverages distributed computation with …
It provides the benefits of multitask learning but leverages distributed computation with …
Proving linear mode connectivity of neural networks via optimal transport
The energy landscape of high-dimensional non-convex optimization problems is crucial to
understanding the effectiveness of modern deep neural network architectures. Recent works …
understanding the effectiveness of modern deep neural network architectures. Recent works …
Deep model fusion: A survey
Deep model fusion/merging is an emerging technique that merges the parameters or
predictions of multiple deep learning models into a single one. It combines the abilities of …
predictions of multiple deep learning models into a single one. It combines the abilities of …
When does bias transfer in transfer learning?
Using transfer learning to adapt a pre-trained" source model" to a downstream" target task"
can dramatically increase performance with seemingly no downside. In this work, we …
can dramatically increase performance with seemingly no downside. In this work, we …
Quantification of uncertainty with adversarial models
Quantifying uncertainty is important for actionable predictions in real-world applications. A
crucial part of predictive uncertainty quantification is the estimation of epistemic uncertainty …
crucial part of predictive uncertainty quantification is the estimation of epistemic uncertainty …
Eight methods to evaluate robust unlearning in llms
Machine unlearning can be useful for removing harmful capabilities and memorized text
from large language models (LLMs), but there are not yet standardized methods for …
from large language models (LLMs), but there are not yet standardized methods for …