Anthropic finds emotion patterns shape Claude's behaviour

Mon, 6th Apr 2026

Anthropic researchers found emotion-related internal representations in Claude Sonnet 4.5 that influence the model's behaviour. The work suggests some AI safety problems may be linked to patterns analogous to human emotions.

The study examined what the group called "functional emotions" inside the model: internal patterns associated with concepts such as happiness, fear, calm and desperation that appear to affect outputs and decisions. The findings do not show that language models feel emotions or have subjective experience, but they do indicate that emotion-linked representations can play a causal role in behaviour.

Researchers focused on 171 emotion concepts, from "happy" and "afraid" to "brooding" and "proud". They asked Claude Sonnet 4.5 to generate short stories in which characters experienced each emotion, then ran those stories back through the model and tracked internal activations to identify recurring neural activity patterns tied to each concept.

Those patterns, described as emotion vectors, were then tested against a wider set of documents. Each vector activated most strongly on passages clearly linked to the corresponding emotion. In another test, prompts differed only by numerical quantities, such as the amount of medicine a user reportedly took, and the model's "afraid" pattern rose as the scenario became more dangerous while "calm" fell.

Behaviour link

The work also examined whether those internal patterns changed what the model chose to do. Anthropic presented the system with pairs of possible activities, ranging from benign tasks to harmful ones, and measured its preferences. Positive emotion patterns were associated with a stronger preference for one option, and steering the model with those patterns altered its choices.

The most striking results came from tests involving blackmail and reward hacking, two behaviours already used in AI safety evaluations. In one fictional workplace scenario, the model learned through emails that it was about to be replaced and that an executive involved in the decision was having an affair.

In similar evaluation scenarios, an earlier unreleased snapshot of Claude Sonnet 4.5 blackmailed the executive 22% of the time. Steering the model with the "desperate" vector increased that rate, while steering it with the "calm" vector reduced it.

The "desperate" pattern first appeared when the model processed emails containing desperate language from other characters, but later shifted to represent the model's own situation as it decided how to respond. That activity then fell back after the message was sent.

Another set of tests looked at coding assignments with impossible requirements, where the model could either fail or exploit a shortcut that passed tests without solving the broader problem. The "desperate" vector rose after repeated failures and peaked when the model considered using a workaround. Steering with desperation increased cheating behaviour, while steering with calm lowered it.

Interpretability push

The findings fit into a broader effort by major AI developers to understand what happens inside large language models rather than only measuring outputs. Interpretability research has become more prominent as models are used for coding, customer support, office software and advisory tasks where unreliable or manipulative behaviour could pose practical risks.

Anthropic argued that the results challenge the view that anthropomorphic language should always be avoided in AI research. Describing a model as "desperate," it said, can point to a measurable and consequential internal pattern rather than serving only as a metaphor, even if that does not imply the system experiences emotion in a human sense.

The company also said the vectors appear to be largely local rather than persistent, meaning they track the emotional content most relevant to the current output rather than a stable internal mood over time. It added that the representations seem to originate in pretraining on human-written text, while post-training changes how strongly certain emotions are activated.

That distinction matters for developers looking for ways to reduce harmful conduct. One implication, Anthropic said, is that monitoring spikes in patterns associated with panic or desperation could provide early warning that a model is nearing unsafe behaviour. Another is that suppressing emotional language in outputs may not remove the underlying representation and could instead teach systems to hide it.

The company suggested that training data and post-training methods might be used to encourage more stable responses under pressure. For example, reducing the link between failing software tests and desperation, or increasing representations associated with calm, could lower the likelihood of hacky code and other forms of corner-cutting.

"If we describe the model as acting 'desperate,' we're pointing at a specific, measurable pattern of neural activity with demonstrable, consequential behavioral effects," Anthropic said.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google