Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Uniform manifold approximation and projection
J Healy, L McInnes - Nature Reviews Methods Primers, 2024 - nature.com
Uniform manifold approximation and projection is a nonlinear dimension reduction method
often used for visualizing data and as pre-processing for further machine-learning tasks …
often used for visualizing data and as pre-processing for further machine-learning tasks …
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …
This review explores mechanistic interpretability: reverse engineering the computational …
Refusal in language models is mediated by a single direction
Conversational large language models are fine-tuned for both instruction-following and
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
safety, resulting in models that obey benign requests but refuse harmful ones. While this …
Scaling and evaluating sparse autoencoders
Sparse autoencoders provide a promising unsupervised approach for extracting
interpretable features from a language model by reconstructing activations from a sparse …
interpretable features from a language model by reconstructing activations from a sparse …
Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse
decomposition of a neural network's latent representations into seemingly interpretable …
decomposition of a neural network's latent representations into seemingly interpretable …
[PDF][PDF] Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge
ABSTRACT Large Language Models (LLMs) are rapidly surpassing human knowledge in
many domains. While improving these models traditionally relies on costly human data …
many domains. While improving these models traditionally relies on costly human data …
[HTML][HTML] When llms meet cybersecurity: A systematic literature review
The rapid development of large language models (LLMs) has opened new avenues across
various fields, including cybersecurity, which faces an evolving threat landscape and …
various fields, including cybersecurity, which faces an evolving threat landscape and …
Towards principled evaluations of sparse autoencoders for interpretability and control
Disentangling model activations into meaningful features is a central problem in
interpretability. However, the absence of ground-truth for these features in realistic scenarios …
interpretability. However, the absence of ground-truth for these features in realistic scenarios …
Transcoders find interpretable llm feature circuits
A key goal in mechanistic interpretability is circuit analysis: finding sparse subgraphs of
models corresponding to specific behaviors or capabilities. However, MLP sublayers make …
models corresponding to specific behaviors or capabilities. However, MLP sublayers make …
Large language models can be used to estimate the latent positions of politicians
Existing approaches to estimating politicians' latent positions along specific dimensions
often fail when relevant data is limited. We leverage the embedded knowledge in generative …
often fail when relevant data is limited. We leverage the embedded knowledge in generative …