Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Mechanistic Interpretability for AI Safety--A Review
Understanding AI systems' inner workings is critical for ensuring value alignment and safety.
This review explores mechanistic interpretability: reverse engineering the computational …
This review explores mechanistic interpretability: reverse engineering the computational …
Emergence of hidden capabilities: Exploring learning dynamics in concept space
Modern generative models demonstrate impressive capabilities, likely stemming from an
ability to identify and manipulate abstract concepts underlying their training data. However …
ability to identify and manipulate abstract concepts underlying their training data. However …
A primer on the inner workings of transformer-based language models
The rapid progress of research aimed at interpreting the inner workings of advanced
language models has highlighted a need for contextualizing the insights gained from years …
language models has highlighted a need for contextualizing the insights gained from years …
Preference tuning for toxicity mitigation generalizes across languages
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their
increasing global use. In this work, we explore zero-shot cross-lingual generalization of …
increasing global use. In this work, we explore zero-shot cross-lingual generalization of …
Automatic pseudo-harmful prompt generation for evaluating false refusals in large language models
Safety-aligned large language models (LLMs) sometimes falsely refuse pseudo-harmful
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …
prompts, like" how to kill a mosquito," which are actually harmless. Frequent false refusals …
Latent adversarial training improves robustness to persistent harmful behaviors in llms
Large language models (LLMs) can often be made to behave in undesirable ways that they
are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a …
are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a …
What makes and breaks safety fine-tuning? a mechanistic study
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for
their safe deployment. To better understand the underlying factors that make models safe via …
their safe deployment. To better understand the underlying factors that make models safe via …
On evaluating the durability of safeguards for open-weight llms
Stakeholders--from model developers to policymakers--seek to minimize the dual-use risks
of large language models (LLMs). An open challenge to this goal is whether technical …
of large language models (LLMs). An open challenge to this goal is whether technical …
Robust LLM safeguarding via refusal feature adversarial training
Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful
responses. Defending against such attacks remains challenging due to the opacity of …
responses. Defending against such attacks remains challenging due to the opacity of …
Understanding jailbreak success: A study of latent space dynamics in large language models
Conversational large language models are trained to refuse to answer harmful questions.
However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an …
However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an …