Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Finding neurons in a haystack: Case studies with sparse probing

W Gurnee, N Nanda, M Pauly, K Harvey… - arxiv preprint arxiv …, 2023 - arxiv.org
Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …

[HTML][HTML] Multimodal neurons in artificial neural networks

G Goh, N Cammarata, C Voss, S Carter, M Petrov… - Distill, 2021 - distill.pub
Gabriel Goh: Research lead. Gabriel Goh first discovered multimodal neurons, sketched out
the project direction and paper outline, and did much of the conceptual and engineering …

Modality competition: What makes joint training of multi-modal network fail in deep learning?(provably)

Y Huang, J Lin, C Zhou, H Yang… - … conference on machine …, 2022 - proceedings.mlr.press
Despite the remarkable success of deep multi-modal learning in practice, it has not been
well-explained in theory. Recently, it has been observed that the best uni-modal network …

Toward understanding the feature learning process of self-supervised contrastive learning

Z Wen, Y Li - International Conference on Machine Learning, 2021 - proceedings.mlr.press
We formally study how contrastive learning learns the feature representations for neural
networks by investigating its feature learning process. We consider the case where our data …

Distributional semantics and linguistic theory

G Boleda - Annual Review of Linguistics, 2020 - annualreviews.org
Distributional semantics provides multidimensional, graded, empirically induced word
representations that successfully capture many aspects of meaning in natural languages, as …

Learning gender-neutral word embeddings

J Zhao, Y Zhou, Z Li, W Wang, KW Chang - arxiv preprint arxiv …, 2018 - arxiv.org
Word embedding models have become a fundamental component in a wide range of
Natural Language Processing (NLP) applications. However, embeddings trained on human …

Reverse engineering self-supervised learning

I Ben-Shaul, R Shwartz-Ziv, T Galanti… - Advances in …, 2023 - proceedings.neurips.cc
Understanding the learned representation and underlying mechanisms of Self-Supervised
Learning (SSL) often poses a challenge. In this paper, we 'reverse engineer'SSL, conducting …

Feature purification: How adversarial training performs robust deep learning

Z Allen-Zhu, Y Li - 2021 IEEE 62nd Annual Symposium on …, 2022 - ieeexplore.ieee.org
Despite the empirical success of using adversarial training to defend deep learning models
against adversarial perturbations, so far, it still remains rather unclear what the principles are …