Google Academic

E Pavlick - … Transactions of the Royal Society A, 2023 - royalsocietypublishing.org

Large language models (LLMs) are one of the most impressive achievements of artificial
intelligence in recent years. However, their relevance to the study of language more broadly …

Salvați Citați Citat de 106 ori Articole cu conținut similar Toate cele 4 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Salvați Citați Citat de 98 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Towards automated circuit discovery for mechanistic interpretability

A Conmy, A Mavor-Parker, A Lynch… - Advances in …, 2023 - proceedings.neurips.cc

Through considerable effort and intuition, several recent works have reverse-engineered
nontrivial behaviors oftransformer models. This paper systematizes the mechanistic …

Salvați Citați Citat de 245 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] aclanthology.org

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

Z Wu, L Qiu, A Ross, E Akyürek, B Chen… - Proceedings of the …, 2024 - aclanthology.org

The impressive performance of recent language models across a wide range of tasks
suggests that they possess a degree of abstract reasoning skills. Are these skills general …

Salvați Citați Citat de 181 ori Articole cu conținut similar Toate cele 8 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

K Wang, A Variengien, A Conmy, B Shlegeris… - arxiv preprint arxiv …, 2022 - arxiv.org

Research in mechanistic interpretability seeks to explain behaviors of machine learning
models in terms of their internal components. However, most previous work either focuses …

Salvați Citați Citat de 424 ori Articole cu conținut similar Toate cele 4 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

M Hanna, O Liu, A Variengien - Advances in Neural …, 2023 - proceedings.neurips.cc

Pre-trained language models can be surprisingly adept at tasks they were not explicitly
trained on, but how they implement these capabilities is poorly understood. In this paper, we …

Salvați Citați Citat de 160 ori Articole cu conținut similar Toate cele 5 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Foundational challenges in assuring alignment and safety of large language models

U Anwar, A Saparov, J Rando, D Paleka… - arxiv preprint arxiv …, 2024 - arxiv.org

This work identifies 18 foundational challenges in assuring the alignment and safety of large
language models (LLMs). These challenges are organized into three different categories …

Salvați Citați Citat de 137 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language models as agent models

J Andreas - arxiv preprint arxiv:2212.01681, 2022 - arxiv.org

Language models (LMs) are trained on collections of documents, written by individual
human agents to achieve specific goals in an outside world. During training, LMs have …

Salvați Citați Citat de 194 ori Articole cu conținut similar Toate cele 3 versiuni Afișare ca HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

T Räuker, A Ho, S Casper… - 2023 ieee conference …, 2023 - ieeexplore.ieee.org

The last decade of machine learning has seen drastic increases in scale and capabilities.
Deep neural networks (DNNs) are increasingly being deployed in the real world. However …

Salvați Citați Citat de 191 ori Articole cu conținut similar Toate cele 5 versiuni

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Interpretability at scale: Identifying causal mechanisms in alpaca

Z Wu, A Geiger, T Icard, C Potts… - Advances in neural …, 2023 - proceedings.neurips.cc

Obtaining human-interpretable explanations of large, general-purpose language models is
an urgent goal for AI safety. However, it is just as important that our interpretability methods …

Salvați Citați Citat de 91 ori Articole cu conținut similar Toate cele 7 versiuni Afișare ca HTML

Creează alerta

Citați

Căutare avansată

Salvat în Bibliotecă

Causal abstractions of neural networks

Symbols and grounding in large language models

Mechanistic Interpretability for AI Safety--A Review

Towards automated circuit discovery for mechanistic interpretability

Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks

Interpretability in the wild: a circuit for indirect object identification in gpt-2 small

How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model

Foundational challenges in assuring alignment and safety of large language models

Language models as agent models

Toward transparent ai: A survey on interpreting the inner structures of deep neural networks

Interpretability at scale: Identifying causal mechanisms in alpaca