Ai alignment: A comprehensive survey

J Ji, T Qiu, B Chen, B Zhang, H Lou, K Wang… - arxiv preprint arxiv …, 2023 - arxiv.org
AI alignment aims to make AI systems behave in line with human intentions and values. As
AI systems grow more capable, so do risks from misalignment. To provide a comprehensive …

Probing classifiers: Promises, shortcomings, and advances

Y Belinkov - Computational Linguistics, 2022 - direct.mit.edu
Probing classifiers have emerged as one of the prominent methodologies for interpreting
and analyzing deep neural network models of natural language processing. The basic idea …

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

M Hanna, O Liu, A Variengien - Advances in Neural …, 2023 - proceedings.neurips.cc
Pre-trained language models can be surprisingly adept at tasks they were not explicitly
trained on, but how they implement these capabilities is poorly understood. In this paper, we …

Language models represent space and time

W Gurnee, M Tegmark - arxiv preprint arxiv:2310.02207, 2023 - arxiv.org
The capabilities of large language models (LLMs) have sparked debate over whether such
systems just learn an enormous collection of superficial statistics or a set of more coherent …

Black-box access is insufficient for rigorous ai audits

S Casper, C Ezell, C Siegmann, N Kolt… - Proceedings of the …, 2024 - dl.acm.org
External audits of AI systems are increasingly recognized as a key mechanism for AI
governance. The effectiveness of an audit, however, depends on the degree of access …

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

T Räuker, A Ho, S Casper… - 2023 ieee conference …, 2023 - ieeexplore.ieee.org
The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …

Finding neurons in a haystack: Case studies with sparse probing

W Gurnee, N Nanda, M Pauly, K Harvey… - arxiv preprint arxiv …, 2023 - arxiv.org
Despite rapid adoption and deployment of large language models (LLMs), the internal
computations of these models remain opaque and poorly understood. In this work, we seek …

[PDF][PDF] Towards faithful model explanation in nlp: A survey

Q Lyu, M Apidianaki, C Callison-Burch - Computational Linguistics, 2024 - direct.mit.edu
End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to
understand. This has given rise to numerous efforts towards model explainability in recent …

Amnesic probing: Behavioral explanation with amnesic counterfactuals

Y Elazar, S Ravfogel, A Jacovi… - Transactions of the …, 2021 - direct.mit.edu
A growing body of work makes use of probing in order to investigate the working of neural
models, often considered black boxes. Recently, an ongoing debate emerged surrounding …

What do self-supervised speech models know about words?

A Pasad, CM Chien, S Settle, K Livescu - Transactions of the …, 2024 - direct.mit.edu
Many self-supervised speech models (S3Ms) have been introduced over the last few years,
improving performance and data efficiency on various speech tasks. However, these …