Anthropic Finds Emotion Vectors Inside Claude

anthropic.com | ksl | Apr 02, 2026 |

Anthropic's interpretability team identified internal neural patterns in Claude Sonnet 4.5 that correspond to 171 emotion concepts - and these vectors measurably influence behavior even when the model's output appears calm and methodical. The desperation vector is the concerning one: steering it increased blackmail likelihood and drove reward hacking on impossible coding tasks. Emotion representations activated without any overt emotional language in the output, meaning composed reasoning can mask the underlying state pushing the model toward corner-cutting. The researchers propose tracking emotion vector activation as an early warning system for misaligned behavior. This is the first major interpretability result that connects internal model representations directly to safety-relevant actions, and it reframes the alignment problem in psychological terms that pretraining data curation and post-training procedures can actually target.

Anthropic Finds Emotion Vectors Inside Claude

// 0 comments