Refusal in Language Models Is Mediated by a Single Direction

A arxiv.org ↗

▲ 118 points • 45 comments • by fagnerbrack • 3w ago • HN discussion ↗

Pangram verdict · v3.3

We believe that this document is fully human-written

13 %

AI likelihood · overall

Human

100% human-written 0% AI-generated

SEGMENTS · HUMAN 1 of 1

SEGMENTS · AI 0 of 1

WORD COUNT 225

PEAK AI % 13% · §1

Analyzed

May 2

backend: pangram/v3.3

Segments scanned

1 windows

avg 225 words each

Distribution

100 / 0%

human / AI fraction

Verdict

Human

Pangram v3.3

Article text · 225 words · 1 segments analyzed

Human AI-generated

§1 Human · 13%

View PDF Abstract:Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior. Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL) Cite as: arXiv:2406.11717 [cs.LG] (or arXiv:2406.11717v3 [cs.LG] for this version) https://doi.org/10.48550/arXiv.2406.11717 arXiv-issued DOI via DataCite Submission history From: Andy Arditi [view email] [v1] Mon, 17 Jun 2024 16:36:12 UTC (237 KB) [v2] Mon, 15 Jul 2024 11:53:41 UTC (183 KB) [v3] Wed, 30 Oct 2024 18:57:07 UTC (194 KB)