Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, the potential large-scale risks associated with misaligned AI …

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org
This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

The clock and the pizza: Two stories in mechanistic explanation of neural networks

Z Zhong, Z Liu, M Tegmark… - Advances in Neural …, 2024 - proceedings.neurips.cc
Do neural networks, trained on well-understood algorithmic tasks, reliably rediscover known
algorithms? Several recent studies, on tasks ranging from group operations to in-context …

Assessing the brittleness of safety alignment via pruning and low-rank modifications

B Wei, K Huang, Y Huang, T **e, X Qi, M **a… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as
evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This …

Cold fusion: Collaborative descent for distributed multitask finetuning

S Don-Yehiya, E Venezian, C Raffel, N Slonim… - arxiv preprint arxiv …, 2022 - arxiv.org
We propose a new paradigm to continually evolve pretrained models, denoted ColD Fusion.
It provides the benefits of multitask learning but leverages distributed computation with …

Proving linear mode connectivity of neural networks via optimal transport

D Ferbach, B Goujaud, G Gidel… - International …, 2024 - proceedings.mlr.press
The energy landscape of high-dimensional non-convex optimization problems is crucial to
understanding the effectiveness of modern deep neural network architectures. Recent works …

Deep model fusion: A survey

W Li, Y Peng, M Zhang, L Ding, H Hu… - arxiv preprint arxiv …, 2023 - arxiv.org
Deep model fusion/merging is an emerging technique that merges the parameters or
predictions of multiple deep learning models into a single one. It combines the abilities of …

When does bias transfer in transfer learning?

H Salman, S Jain, A Ilyas, L Engstrom, E Wong… - arxiv preprint arxiv …, 2022 - arxiv.org
Using transfer learning to adapt a pre-trained" source model" to a downstream" target task"
can dramatically increase performance with seemingly no downside. In this work, we …

Quantification of uncertainty with adversarial models

K Schweighofer, L Aichberger… - Advances in …, 2023 - proceedings.neurips.cc
Quantifying uncertainty is important for actionable predictions in real-world applications. A
crucial part of predictive uncertainty quantification is the estimation of epistemic uncertainty …

Eight methods to evaluate robust unlearning in llms

A Lynch, P Guo, A Ewart, S Casper… - arxiv preprint arxiv …, 2024 - arxiv.org
Machine unlearning can be useful for removing harmful capabilities and memorized text
from large language models (LLMs), but there are not yet standardized methods for …