- Academic Search

Turnitin 降AI改写早检测系统早降重系统 Turnitin-UK版万方检测-期刊版维普编辑部版 Grammarly检测 Paperpass检测 checkpass检测 PaperYY检测

Uniform manifold approximation and projection

J Healy, L McInnes - Nature Reviews Methods Primers, 2024 - nature.com

Uniform manifold approximation and projection is a nonlinear dimension reduction method
often used for visualizing data and as pre-processing for further machine-learning tasks …

Tallenna Viittaa Viittausten määrä 11 Aiheeseen liittyviä artikkeleita

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Mechanistic Interpretability for AI Safety--A Review

L Bereska, E Gavves - arxiv preprint arxiv:2404.14082, 2024 - arxiv.org

Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …

Tallenna Viittaa Viittausten määrä 94 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Refusal in language models is mediated by a single direction

A Arditi, O Obeso, A Syed, D Paleka… - Advances in …, 2025 - proceedings.neurips.cc

Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …

Tallenna Viittaa Viittausten määrä 71 Aiheeseen liittyviä artikkeleita Kaikki 5 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Scaling and evaluating sparse autoencoders

L Gao, TD la Tour, H Tillman, G Goh, R Troll… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …

Tallenna Viittaa Viittausten määrä 99 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

T Lieberum, S Rajamanoharan, A Conmy… - arxiv preprint arxiv …, 2024 - arxiv.org

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …

Tallenna Viittaa Viittausten määrä 62 Aiheeseen liittyviä artikkeleita Kaikki 5 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] rivista.ai

[PDF][PDF] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

T Wu, W Yuan, O Golovneva, J Xu, Y Tian, J Jiao… - arxiv preprint arxiv …, 2024 - rivista.ai

ABSTRACT Large Language Models (LLMs) are rapidly surpassing human knowledge in
many domains. While improving these models traditionally relies on costly human data …

Tallenna Viittaa Viittausten määrä 42 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[HTML] springeropen.com

[HTML][HTML] When llms meet cybersecurity: A systematic literature review

J Zhang, H Bu, H Wen, Y Liu, H Fei… - …, 2025 - cybersecurity.springeropen.com

The rapid development of large language models (LLMs) has opened new avenues across
various fields, including cybersecurity, which faces an evolving threat landscape and …

Tallenna Viittaa Viittausten määrä 40 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota Välimuistissa

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Towards principled evaluations of sparse autoencoders for interpretability and control

A Makelov, G Lange, N Nanda - arxiv preprint arxiv:2405.08366, 2024 - arxiv.org

Disentangling model activations into meaningful features is a central problem in
interpretability. However, the absence of ground-truth for these features in realistic scenarios …

Tallenna Viittaa Viittausten määrä 24 Aiheeseen liittyviä artikkeleita Kaikki 6 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] neurips.cc

Transcoders find interpretable llm feature circuits

J Dunefsky, P Chlenski… - Advances in Neural …, 2025 - proceedings.neurips.cc

A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of
models corresponding to specific behaviors or capabilities. However, MLP sublayers make …

Tallenna Viittaa Viittausten määrä 10 Aiheeseen liittyviä artikkeleita Kaikki 4 versiota HTML-versio

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Large language models can be used to estimate the latent positions of politicians

PY Wu, J Nagler, JA Tucker, S Messing - arxiv preprint arxiv:2303.12057, 2023 - arxiv.org

Existing approaches to estimating politicians' latent positions along specific dimensions
often fail when relevant data is limited. We leverage the embedded knowledge in generative …

Tallenna Viittaa Viittausten määrä 54 Aiheeseen liittyviä artikkeleita Kaikki 3 versiota HTML-versio

Viittaa

Tarkennettu haku

Tallennettu omaan kirjastoon

Uniform manifold approximation and projection

Mechanistic Interpretability for AI Safety--A Review

Refusal in language models is mediated by a single direction

Scaling and evaluating sparse autoencoders

Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

[PDF][PDF] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge

[HTML][HTML] When llms meet cybersecurity: A systematic literature review

Towards principled evaluations of sparse autoencoders for interpretability and control

Transcoders find interpretable llm feature circuits

Large language models can be used to estimate the latent positions of politicians