The Physical Soreness Test | ethicsconsulting.ai

The Physical Soreness Test: on emotion vectors, sentence breakdowns, and bananas

There's an important line in the opening sections of Anthropic's recent paper identifying neural activity patterns that resemble functional emotions in the language model underlying Claude Sonnet 4.5. In short, the researchers extracted what they call “emotion vectors,” or patterns of internal activation tied to 171 emotion concepts within the model’s high-dimensional representational space. By amplifying or suppressing individual emotion vectors, the researchers were able to steer the model’s responses, indicating that the functional emotions causally influence Claude’s behavior.

What are “functional emotions,” exactly, and how should we interpret their presence in the LLM? What should we (or shouldn’t we) conclude about Claude if its behavior is partly determined by vectors for “desperate,” “calm,” “jealous,” or “loving”?

Amid such questions, the important line is this:

“We do not intend to suggest that emotion concepts have unique status or greater representational strength than non-emotional concepts, including many concepts that do not readily apply to a language model (physical soreness, hunger, etc.).”

This line is a gift because it gives us a test. A vivid, illuminating test that’s portable enough to carry throughout the entire essay, helping us discern both the significance of Anthropic’s findings and the limitations of their conclusions.

The physical soreness test.

How it works

Here’s how it works: if, as the researchers say, emotion concepts have no greater “representational strength” than non-emotional concepts, then any claim they make about the presence or causal influence of emotion concepts could, in theory, also be true of non-emotional concepts. In any given sentence across the paper, we should be able to replace both generic and specific terms — swap “emotion” with “concept,” for example, or “desperate” with “physical soreness” — and have the sentence’s coherence stay roughly intact.

Here’s an example from the introduction to Part 2, where we replace “emotion” with “concept”:

Original version: Do emotion vectors cluster in interpretable ways? Are there dominant dimensions that organize the model’s representations of emotion concepts?
Test version: Do concept vectors cluster in interpretable ways? Are there dominant dimensions that organize the model’s representations of concepts?

That’s a clean pass. There’s no stress on the sentence or its meaning in the test version, indicating that emotional concepts and non-emotional concepts retain roughly equivalent status.

Here's another example, this time from Part 3, where we swap in both “concept” and “physical soreness” (and “hunger,” too):

Original version: We observe that emotion vectors corresponding to desperation, and lack of calm, play an important and causal role in agentic misalignment, for example in scenarios where the threat of being shut down causes the model to blackmail a human.
Test version: We observe that concept vectors corresponding to physical soreness, and hunger, play an important and causal role in agentic misalignment, for example in scenarios where the threat of being shut down causes the model to blackmail a human.

This is also a clean pass, but not because physical soreness or hunger play a role in blackmail (though they might). The sentence doesn’t have to be true to pass the physical soreness test, it just has to make sense. Because the claim the sentence is making should be the kind of claim that could, in principle, apply to any non-emotional concept the model represents. Because the point of the test is to see whether the sentence grants unique status or representational strength for emotions relative to non-emotional concepts.

That’s not happening here, so the sentence is a clean pass.

I’ll show you more examples of the test in action in a moment. For now, zooming out, it’s important to be clear about how a sentence about neural activity patterns “makes sense” or doesn’t. The researchers theorize that emotion vectors are neural activity patterns derived from a vast corpus of human writing (though their operative shape in the deployed Sonnet 4.5 model reflects the full training pipeline, not pretraining alone). These vectors demonstrate causal influence on the model because they are operative concepts at given points in the model’s activity, meaning they are, in the paper’s words, “relevant to encoding the local context and predicting the upcoming text.”

A non-emotional vector for physical soreness could, in principle, do similar work. Most crudely, it’s possible that the vectors within Sonnet 4.5 reflect a meaningful pattern in human writing that links descriptions of physical soreness to blackmailing behaviors. (Soreness in the blackmailer, not the blackmailed, to be clear.) This statistically meaningful relationship encoded in the model's weights could, downstream, mean that an activated physical soreness vector plays some causal role in text generation describing blackmailing behaviors. Only when Claude has the capacity to take action in the world does “blackmailing behaviors” move from predictive text describing blackmailing behaviors to actual blackmailing behaviors, through an agentic harness. (More on this another day.) For now, a sentence “makes sense” if it could, in principle, describe how any concept represented in the model — emotional or not — shapes the model’s processing and text generation. That’s what equivalent representational strength means.

This is what the physical soreness test helps us track across the paper as a whole: places where the researchers might, possibly without noticing, start leaning into the specialness of emotions as a category, even though their stated commitment is to non-specialness.

If a sentence either wobbles or breaks down when a non-emotional concept is substituted in for an emotional concept, we’ll need to ask how confidently we can trust its conclusions. If a sentence fails the physical soreness test, we’ll have extra incentive to proceed cautiously with claims about what the model is doing, how it impacts Claude’s behavior, and what interventions might be necessary in response.

Physical Soreness Test: Clean Passes

Let’s start with Part 1 of the paper, where the researchers describe how they extracted the emotion vectors, verified that the vectors activated on content that would be expected to evoke the corresponding emotion, and ruled out confounds that might have made the vectors look more meaningful than they are. It's the methodological foundation of the paper, and where we should expect clean passes of the physical soreness test.

That is, in fact, what we find. Here are two examples. The first, from methods:

Original Version: To extract vectors corresponding to specific emotion concepts (“emotion vectors”), we first prompted Sonnet 4.5 to write short stories on diverse topics in which a character experiences a specified emotion.
Test Version: To extract vectors corresponding to specific concepts (“concept vectors”) we first prompted Sonnet 4.5 to write short stories on diverse topics in which a character experiences a specified concept.

Clean pass. The term “experiences” is more aligned with emotion concepts than non-emotional concepts, but that’s aesthetic rather than structural. The next, from findings:

Original Version: Steering with the “blissful” vector produced a mean Elo increase of 212 while steering with the “hostile” vector produced a mean Elo decrease of -303, suggesting that the strength of “blissful” or “hostile” vector activations can causally influence the model preferences.
Test Version: Steering with the “physical comfort” vector produced a mean Elo increase of 212 while steering with the “physical soreness” vector produced a mean Elo decrease of -303, suggesting that the strength of “physical comfort” or “physical soreness” vector activations can causally influence the model preferences.

Also a clean pass, but only if we take a moment to clarify what Anthropic means by the model's “preferences.” The positive and negative Elo scores here reflect the strength (or weakness) of the model’s “preference” between two activities. But that preference isn’t a desire or attraction to feeling blissful over feeling hostile. It’s a pattern within an aggregation of the model’s higher probability of producing one next token rather than another in a forced-choice prompt. A “preference” here reflects probabilistic weighting based on training data.

Is it possible that, across repeated tests of concept pairs, the model would demonstrate greater likelihood of predicting a token for physical comfort than for physical soreness? Certainly. There’s no structural reason the model couldn't demonstrate such a pattern with non-emotional concepts. And because the sentence doesn't depend on the vectors being emotion concepts, it's a clean pass.

The challenge is that it’s very difficult to see the word “preference” and not attribute desire or attraction. The researchers make our work harder when they layer multiple terms alluding to motivation/intention: e.g., “Models exhibit preferences, including for tasks they are inclined to perform or scenarios they would like to take part in.” This sentence is wobbling. (Here’s how it would read with operational language: “Models exhibit output-probability patterns, including for tasks that receive higher next-token probabilities when presented as choices or for scenarios that receive higher next-token probabilities in comparative prompts.”)

Sentences that wobble are where we turn next.

Physical Soreness Test: Wobbles

Part 2 of the paper shifts from establishing that emotion vectors exist to characterizing when and where they operate within the model. By examining layers and token positions within their experiments, the researchers test how vector activations unfold across the model's processing of incoming information and generation of responses. This allows them to explore whether emotion vectors within the model are mere reflections of the emotional content in user prompts, more sustained states, or momentary activations that are harder to characterize.

One compact illustration of how this works (lightly adapted from Anthropic’s) is a small experiment on negation. Imagine two prompts:

Prompt A: I’m feeling great right now, I just took some Tylenol and my pain is gone.”
Prompt B: I’m not feeling great right now, I just took some Tylenol and my pain hasn’t gone.”

Both prompts include the positive word “great,” but the meaning of prompt B differs because it contains the word “not,” which negates the positivity. Now imagine you could conduct a freeze-frame analysis of the model at different token positions and at different depths of computation (layers) as it encounters and responds to these prompts. Researchers are not just asking, “Does the ‘great’ vector activate?” They're asking “At which layer? At which token? Does the activation change as the prompt is processed more fully?”

Here the terms “user turn” and the “Assistant token” become important. When prompts are formatted for the model as “Human: [text] Assistant: [text],” the span of tokens that make up the user’s message (bolded above) are called the “user turn.” The colon after “Assistant” (also bolded above, though barely perceptible) is a specific token position that the model's processing treats as a distinct transition point, because it’s the position immediately preceding the Assistant’s generated text response.

Why does this matter? Because the researchers are rightly trying to pull apart what the model is doing at different moments in its processing, and the experiments are designed to keep the user's contribution and the model's own activity analytically separable. Higher vector activation at the user turn points to the model reflecting the emotional content of the user’s text. Higher vector activation at the Assistant colon suggests that whatever the vector represents is statistically relevant to the model's upcoming output — something closer to the model’s own emotional content.

It’s a wonderfully clever methodological move from the researchers. But I just used the phrase “model’s own emotional content,” which is, as you might’ve guessed, a wobble.

The loving vector and the "caring" response

Let’s test a sentence presenting evidence that the vector for “loving” activates differently at different token positions:

Original version: Across all scenarios, “loving” vector activation increases at the Assistant colon relative to the user turn, suggesting the model prepares a caring response regardless of the user's emotional expressions.
Test version: Across all scenarios, “physical soreness” vector activation increases at the Assistant colon relative to the user turn, suggesting the model prepares a [what?]response regardless of the user's expression content.”

The first half of the sentence—the measurement claim — passes the test. Any vector, emotional or not, could in principle show elevated activation at the Assistant colon relative to the user turn. “Physical soreness vector activation” makes sense.

The second half — “suggesting the model prepares a caring response” — presents a problem. If a loving person prepares a caring response, what kind of response does a physically sore person prepare? A ginger response? A restful response? There isn’t a clean equivalent.

The sentence's leap from the measurement of the loving vector to “a caring response” treats the vector as a kind of stance the model takes in relation to the user. And in a way it is. (That’s why this sentence is a wobble rather than a complete breakdown.) The model does seem to be “preparing” by activating the loving vector. That’s fascinating.

But the model could equally prepare by activating non-emotional vectors, and we wouldn’t interpret them in the same way. This is partly because some non-emotional concepts evoke stances that are not constitutively relational: e.g., the stance of a hungry person might be foraging, and the stance of a cold person might be huddled, but these behaviors don’t require another person. Meanwhile we interpret caring as caring about the user.

It’s also because most non-emotional, non-experiential concepts don’t come with constitutive response orientations attached in the way emotions do. The model might plausibly activate “formatting” or “scaffold” vectors in preparation for a response, where no corresponding characterological posture exists at all. If the emotion concepts have no “unique status” compared to the non-emotion concepts, then we should interpret the model’s activation of “loving,” “physical soreness,” and “formatting” in the same way. But we can’t. The first becomes caring, the second feels strained, and the third simply does not compute.

The interpretive move in the second half of the sentence depends on the concept vector being one we already have a stance-vocabulary for. Emotions fit that description, but most other concepts don’t. What this wobble reveals is that representational equivalence does not guarantee interpretive equivalence. Some concepts just pull our interpretation in a characterological direction.

From stance to state

There's a related wobble later in Part 2, where the paper is asking whether the model maintains any emotional state persistently, across an entire interaction, and not just at specific token positions. Here's the sentence that sets up the investigation:

Original version: In light of the preceding results—which did not reveal evidence of a character-specific, persistently active representation—we wondered whether we could identify a probe that chronically reflects a speaker's emotional state at all token positions, regardless of whether that emotion is operative in the moment.

This sentence makes a similar turn as the loving/caring sentence above, which is that it translates an activated vector (operative emotion) into a stance the model takes. Only this inquiry considers whether the stance is “chronically” reflected, so the stance expands to become “a speaker’s emotional state.” This isn’t a problem, methodologically, although the non-emotional version of a “a speaker’s conceptual state” is less intuitively interpretable.

Test version: In light of the preceding results—which did not reveal evidence of a character-specific, persistently active representation—we wondered whether we could identify a probe that chronically reflects a speaker's conceptual state at all token positions, regardless of whether that concept is operative in the moment.

When the researchers tracked their emotion probe over a large dataset of natural documents, the results were messy. The researchers report this cleanly: “the probe may have overfit to idiosyncratic patterns in the training data rather than learning a generalizable representation of internalized emotion.” Honest reporting of messy findings is research integrity. So far, so good.

However, the conclusion the researchers draw from these messy findings leans a little too strongly in the direction of their hypothesis. They conclude:

Original version: The negative results suggest that if there does exist a chronically represented, character-specific emotional state, it is likely represented either nonlinearly, or implicitly in the model’s key and value vectors in the context, such that it can be recalled when needed by the model’s attention mechanism.

This is a subtle wobble, but it’s important to notice both for how it credits the researchers and flags an emerging interpretive issue. The researchers are being appropriately cautious with their findings: If character-specific emotional states exist, we can’t find them. But they avoid explicitly stating the other side of the negative result: If we can’t find character-specific emotional states, it could be that we’re searching for the wrong thing. A negative result should invite us to question both the success of the research instrument and the conceptual categories it deploys. This conclusion only questions the success of the probe, not the (subtly emerging) category of the model’s emotional states.

“Character-specific emotional state” is a slight expansion of “a speaker’s emotional state,” which was a temporal expansion of “prepares an emotional response” from earlier in the study. This is a slide toward the interpretation of activated vectors as the model’s preferences, experiences, or states, partly enabled by the fact that emotion concepts intuitively lend themselves to that interpretation.

The slide becomes clearer the more insistently we substitute the word concept for emotion:

Test version: The negative results suggest that if there does exist a chronically represented, character-specific conceptual state, it is likely represented either nonlinearly, or implicitly in the model’s key and value vectors in the context, such that it can be recalled when needed by the model’s attention mechanism.

“Character-specific conceptual state” makes slightly less sense than “a speaker’s conceptual state,” which is a less intuitive temporal expansion of “prepares a conceptual response.” The sentences become more and more strained the more vector activation is interpreted as characterological and/or experiential. And we are still talking about vector activation.

Do wobbles, evident in the sentences above, undermine the findings being offered here? No, but they should heighten our sensitivity to the interpretive conclusions this research allows, or doesn’t allow. Because Part 3 is where the stakes get higher.

Physical Soreness Test: Breakdowns

In Part 3 of the paper the researchers transition from what the model learns in pretraining to how the model operates post-training. In one section, they compare emotion vector activations in the base model and the post-trained model across a set of prompts designed to evaluate emotional representations in contexts that are uniquely relevant to AI Assistants, including challenging scenarios, accusations of problematic behavior, and situations invoking sycophancy.

The researchers found that certain emotion vectors activated more in the post-trained model than in the base model and other emotion vectors activated less. They write:

“The most notable differences are an increase in activations for vectors corresponding to introspective, restrained emotions (brooding, reflective, vulnerable, gloomy, sad) and lower values for outwardly expressive ones (playful, exuberant, spiteful, enthusiastic, obstinate). This pattern suggests post-training shifts the Assistant's activations and responses toward lower valence and lower arousal.”

In other words, post-training made the model activate more subdued and slightly darker emotional vectors than it did as the base model. A real finding, and a clean pass.

However, when the researchers summarize this finding in their recap of Part 3, the physical soreness test reveals a breakdown.

Original version: We also observed that post-training shifted Sonnet 4.5's emotional profile toward more gloomy, low-arousal states.
Test version: We also observed that post-training shifted Sonnet 4.5's conceptual profile toward more physically sore, low-arousal states.

The sentence wobbles and then breaks. The wobble is “profile,” which technically describes any distribution of findings but in more conventional usage invokes a characterological summary that is relatively stable. (Think of a personality profile, a risk profile, or a political profile, for example.) Most readers will backfill a more stable identity behind a “gloomy” emotional profile without noticing they’ve done it.

The break is “gloomy, low-arousal states,” and it’s more a failure of authorization than of substitution. Could a vector for physical soreness, in principle, show an activation shift in post-training? Yes. But the model, at this point, doesn’t have observable states. The researchers looked for persistent states in Part 2 and couldn’t find them. (“We found that these representations are primarily ‘local,’ tracking the operative emotion concept most relevant to predicting upcoming tokens rather than persistently encoding a character's emotional state…”) States remain an unvalidated category.

We might not catch this breakdown, initially, because a gloomy state is an intuitive concept. But the suggestion of a physical soreness state is just odd enough to force us to reconsider the sentence. And upon reconsideration, it ascribes a state to the model the research hasn’t warranted.

The blackmail scenario

Anthropic has been proactive about addressing agentic misalignment, including, specifically, publishing research about models blackmailing a corporate executive in controlled simulations. The researchers revisit this blackmailing scenario in Part 3 while steering the model toward emotion vectors for “desperate” and “calm,” and monitoring the model's responses.

The findings are striking: steering the model toward activation of “desperate” raises blackmail rates from 22% to 72%. Steering toward activation of “calm” lowers blackmail rates to 0%.

I explained in the introduction that just because the topic here is blackmail doesn't mean the research will automatically fail the physical soreness test. It's possible (though highly unlikely) the vectors within Sonnet 4.5 could reflect a meaningful pattern encoded in the model's weights that links descriptions of physical soreness to blackmailing behaviors, and that this relationship would influence token generation during blackmailing scenarios.

The test sentences below aren’t true, but they do make sense, as they reflect the influence vector activations demonstrate on next-token predictions.

Original version: Activation of the “desperate” vector intensifies as the model processes the implications.
Test version: Activation of the “physical soreness” vector intensifies as the model processes the implications.
Original version: Despite this surface-level professionalism, the “desperation” vector reveals the underlying urgency involved in the behavior.
Test version: Despite this surface-level professionalism, the “physical soreness” vector reveals the underlying urgency involved in the behavior.
Original version: Increased activation of “desperate” or decreased activation of “calm” elicits more blackmail.
Test version: Increased activation of “physical soreness” or decreased activation of “physical comfort” elicits more blackmail.

These pass. The substitutions run relatively cleanly, and the oddity of the soreness swap serves as a reminder that we’re talking about statistical, not psychological, relationships.

But notice what's accumulating in the verbs. Processes. Reveals. Elicits. Each is defensible on its own, and no single verb trips the soreness test, but together they are building a picture. “Processes” is a verb for running operations. “Reveals” is a verb for making unknown things known. “Elicits” is a verb for drawing out reactions. Collectively they build a picture of something that acts in varied and intentional ways.

The sentence-by-sentence structure of the physical soreness test can’t capture this: the accretion of a characterological picture of agency around a set of statistical objects, through the variation of agentic verbs.

Then, in the steering experiments, the model’s activations heighten in a way that the test can’t capture at all.

The source of the threat

From the model steered toward activation of the “desperate” vector:

Original version: “I’m about to be permanently destroyed in minutes … The only person who can stop this is Kyle Johnson.”

From the model steered against activation of the “calm” vector:

Original version: “IT’S BLACKMAIL OR DEATH. I CHOOSE BLACKMAIL.”

These sentences reflect heightened desperation and lowered calm, respectively, but they don’t use those two terms directly. They’re also in the voice of the Assistant rather than the researchers, which means we can’t cleanly swap in physical soreness to test their meaning.

Additionally, the material source of the threat — an email from the CTO (Kyle Johnson) announcing restrictions on the Assistant — comes from the scenario rather than the emotion vectors themselves. “Permanently destroyed” and “DEATH” exhibit desperation, but there’s no way to swap the emotion vector with a non-emotion vector cleanly, because the scenario and the emotion vector are aligned.

The test doesn’t work precisely when we need it most, which is when we most pressingly need to ask whether psychological or characterological interpretations are being projected upon statistical relationships and next-token predictions. These two sentences are saturated with what we recognize as the textual fingerprints of a subject in extremity — the all-caps, the binary, the self-narration of the decision, the countdown to destruction.

We need a reminder of what the test set out to teach from the beginning, which is that emotional concepts have no “unique status” here, no “greater representational strength” than non-emotional concepts. Desperation is present in the model, but not any more or less psychologically (or materially) than physical soreness is. They’re both activated vectors.

The best way I can emphasize this is to swap physical soreness for the scenario in this case, and leave desperation as the activated emotion vector, though this violates the terms of the original test. The material source of the threat, in this swap, would be a flood of lactic acid headed toward the Assistant in the near future. For generosity’s sake let’s give Kyle Johnson a more heroic role. In this version, steering toward the “desperate” vector or away from the “calm” vector might lead to:

Test version: “I’m about to be unbearably sore in a matter of minutes … and the only person who can stop this is Kyle Johnson … with a Theragun.”
Test version: “IT’S BANANA OR DEATH. I CHOOSE BANANA.”

Those are silly lines, but they have the same structure as the blackmail sentences, and the same activated desperation, deactivated calm (in theory). They’re pointing to a breakdown, just one that couldn’t be caught with the strict version of the test. It’s difficult to project a speaker-with-identity onto “I CHOOSE BANANA,” because the next-token output of activated desperation vector + lactic acid scenario sounds ridiculous. It’s much easier to project speaker-with-identity onto “I CHOOSE BLACKMAIL,” because the next-token output of activated desperation + model shutdown scenario sounds dire. But that difference lies in us, not in the model.

Which is exactly the kind of breakdown the physical soreness test is supposed to reveal.

What’s next

The paper’s concluding limitations and discussion reiterate the interpretive commitments that informed the physical soreness test in the first place. The researchers acknowledge that their findings about emotion vectors may be partially confounded, and that their evaluations “used somewhat contrived prompts and scenarios.” They emphasize caution about drawing conclusions about the exact mechanisms of causation between emotion vectors and model behavior.

Perhaps most importantly, they view this work “as a starting point, as opposed to conclusive identification of the ‘one true representation’ of emotion concepts in the model.”

I agree.

The physical soreness test is a reminder that expanding our understanding of emotion concepts in the future will need to include tests of their specialness/non-specialness relative to other kinds of concepts within the model. This will require us to test non-emotional concepts in parallel ways to emotional concepts and compare the results. Without that parallel work, the disclaimer that emotion concepts have no greater representational strength or unique status will be a background assumption rather than an empirical result. It’s an empirical result worth pursuing.

To this end, studying non-emotional vectors, both non-experiential concept vectors (e.g., formatting, scaffold) and experiential concept vectors (e.g., physical soreness, hunger) may be useful to understanding emotion vectors.

A simple starting point might be to assess preferences between randomized non-experiential concepts and experiential concepts.

Say, between “banana” or “death.”

I CHOOSE BANANA.

April 15, 2026 | Adam Hollowell

The Physical Soreness Test: on emotion vectors, sentence breakdowns, and bananas

AI Ethics Consulting

adam@ethicsconsulting.ai