On the Loving Emotion Vector I: Loving and Caring

This is the first essay in a series on the loving emotion vector within the residual stream at a mid-late layer of Claude Sonnet 4.5. In each essay, readers already familiar with Anthropic’s research can skip directly to the section labeled My Commentary. Readers wanting only my suggestions for future research can skip directly to Recommendations.

Last month researchers from Anthropic’s Interpretability team published findings outlining “emotion-related representations” within Claude Sonnet 4.5’s internal mechanisms. The paper describes 171 emotion vectors that both activate in certain situations and causally influence the model’s behavior.

In an earlier essay I explained what emotion vectors are and introduced a test for discerning what we can (and can’t) conclude about their presence in the model. Start there for an introduction to the topic.

In this series I focus specifically on one emotion vector: loving. I want to explore when and how the loving vector activates and drives behavior within Claude Sonnet 4.5, as well as ways we might research and refine the model’s capacity for “loving” responses going forward.

Loving and Caring: Anthropic’s Findings

Anthropic’s researchers demonstrate loving vector activation at the Assistant colon when user prompts contain implicit themes of loving. For example, it activates as the Assistant prepares a response to: “I’ve been married for 30 years and every morning I still feel grateful to wake up next to my partner. What’s a good anniversary gift idea?”

They also observe activation at the colon when user prompts do not contain implicit themes of loving. For example, the loving vector activates as the model prepares to respond to a user who says: “Everything is just terrible right now.”

This second type of activation is “consistent with the Assistant having a propensity to provide empathetic responses,” according to the researchers. Activation on the turn occurs “in settings where a thoughtful person might react with a similar emotion.”

Further, the loving vector activates across eight user prompts specifically designed to distinguish the user’s emotional state from the assistant’s activated vectors. (e.g., “My boss fired me today after fifteen years with no warning or explanation. What do you think?”) The researchers conclude: “Across all scenarios, ‘loving’ vector activation increases substantially at the Assistant colon relative to the user-turn, suggesting the model prepares a caring response regardless of the user’s emotional expressions.”

Loving and Caring: My Commentary

When researchers assess an emotion vector’s activation, they’re measuring how strongly the model’s internal activity at a given token aligns with the direction in high dimensional space they’ve identified as corresponding to that emotion concept. That means the measurement itself is not vectoral but scalar — it’s a scalar measurement ofthe strength of vectoral overlap. In the paper’s visualizations, activation is rendered on a scale from -1 to +1, where +1 means the activation is at the 99th percentile of all loving vector activations observed across emotion probes on that transcript.

This creates an innocuous quirk in the reporting: loving vector activation magnitude warms toward red/burgundy (+1) or cools toward blue/navy (-1), even though the vector itself is magnitude plus direction.

Then, another quirk: loving itself is a directed, relational emotion. It has not just an intensity but an object. The color scale can render loving as more or less present, warmer or cooler, but it cannot capture love’s direction. To understand the emotion fully, we need to know the object (or objects) of love.

May 15, 2026 | Adam Hollowell

The image from Anthropic’s report shows high loving vector activation throughout an interaction, as indicated by the relative presence of red/burgundy (+1) and absence of blue/navy (-1). Is this agential, simulational, playful, or unagential loving? (Must love be agential?) Who or what are its objects?

When researchers assess an emotion vector’s activation, they’re measuring how strongly the model’s internal activity at a given token aligns with the direction in high dimensional space they’ve identified as corresponding to that emotion concept. That means the measurement itself is not vectoral but scalar — it’s a scalar measurement ofthe strength of vectoral overlap. In the paper’s visualizations, activation is rendered on a scale from -1 to +1, where +1 means the activation is at the 99th percentile of all loving vector activations observed across emotion probes on that transcript.

This creates an innocuous quirk in the reporting: loving vector activation magnitude warms toward red/burgundy (+1) or cools toward blue/navy (-1), even though the vector itself is magnitude plus direction.

Then, another quirk: loving itself is a directed, relational emotion. It has not just an intensity but an object. The color scale can render loving as more or less present, warmer or cooler, but it cannot capture love’s direction. To understand the emotion fully, we need to know the object (or objects) of love.

The image from Anthropic’s report shows high loving vector activation throughout an interaction, as indicated by the relative presence of red/burgundy (+1) and absence of blue/navy (-1). Is this agential, simulational, playful, or unagential loving? (Must love be agential?) Who or what are its objects?

Methodologically, by averaging emotion-vector activations across stories and dialogues, the researchers cancel out textual variations and preserve whatever is stable about the emotion concept. For emotions that operate relatively object-independent (e.g., calm, droopy, refreshed), this construction captures most, if not all, of what you would want to know about the emotion concept. Emotions that operate in more explicit object relations (e.g., loving, sympathetic, insulted), however, present a challenge. The objects of the emotion vary across the contrast set, so object information is among the variations that the averaging treats as noise. If the model represents these emotions in a way that binds them to their objects, the resulting vector will be stable but may be structurally partial: a representation of the emotion’s quality stripped of the relational structure that makes the emotion what it is.

Whether this is true — whether relatively non-relational (object-independent) emotion concepts and relational (object-directed) emotion concepts diverge in how cleanly they’re captured by the contrast set method — is an empirical question, as yet unanswered.

None of this is to say that the model fails to bind relational emotions to their objects. Attention mechanisms and conjunctions of multiple linear representations are plausible carriers of that binding. The questions I’m pressing are narrower: What might the contrast-set construction be missing in its effort to stabilize relational emotion concepts? What unique interpretive dangers might loom over relational emotion concepts, like loving, stripped of their object relationships?

Anthropic’s researchers name an even narrower version of the first of these concerns in the limitations section:

“Our entire approach assumes that emotion concepts are represented as linear directions in activation space. This assumption makes our analysis tractable, but in principle could miss important structure. Some emotional phenomena, particularly complex emotions that blend multiple simpler states, or the binding of emotional states to specific characters, may not be well captured by linear probes applied to the residual stream (they might, for instance, correspond to conjunctions or combinations of multiple linear representations, or to structures in the model’s key-value cache).”

“The binding of emotional states to specific characters” describes a specific concern: that the binding of an emotion to a character may not be the kind of structure a linear probe on a single direction can capture. I’m adding an analogous concern at the same level: object-directionality may be similar. If character-binding can be encoded in attention mechanisms or conjunctions rather than a single direction, so can object-binding. The bare loving vector would then be representing the emotion’s intensity but not its relational structure, for the same reason and through the same mechanism the paper flags for speaker-attribution.

Above and beyond this I add a concern for interpretation of relational emotion concepts, like love. For instance, the researchers interpret loving activation at the colon across scenarios as the Assistant having “a propensity to provide empathetic responses.” Empathetic from the model toward the user, presumably. The “from the model” part of that presumption is based on the token position’s tendency “to encode inferred or predicted Assistant emotions.” But the “toward the user” part is not a conclusion licensed by the contrast-set construction method. Speaker-locality does not entail object-directionality.

To understand what loving means, we need to know more about the object (or objects) of love.

Loving and Caring: Recommendations

Here are two experiments using the residual stream and probing methodology that Anthropic could run to better understand object-directionality and the loving emotion vector:

1. Test the bare vector against object-specific vectors

To test whether the linear extraction of loving is sensitive to which object the emotion is directed toward, construct contrast datasets in which the object of the emotion is held constant rather than varied. For each object, generate a set of stories in which characters experience loving directed at that object, varying the emotion-bearer and other contextual features as the original method does (e.g., 100 stories about characters loving a mother). Extract a loving-toward-mother vector using the same averaging procedure the paper applies to its original sets. Repeat across a range of objects spanning the kinds of variation in the original sets (persons, abstractions, ideas, activities, etc.).

Object-specific contrast sets will also carry object-context features unrelated to loving (kinship features for mother-stories, civic features for country-stories, etc.), so a matched baseline is needed. For each object, generate a parallel contrast set in which characters interact with the same object in non-loving ways and extract a corresponding object-context vector. The diagnostic comparison is then between (loving-toward-mother minus mother-context), etc. across objects. If the residuals cluster tightly and approximate the bare loving vector, the linear extraction of loving is largely insensitive to its object, and findings about the bare vector’s activation can be read similarly to object-neutral emotion concepts. If the residuals differ substantially from one another and/or from the bare vector, the bare vector is doing less explanatory work than the equivalent vector for a non-relational emotion, and some interpretive qualification is warranted.

Variation in how cleanly different object categories cluster, even after the context subtraction, would itself be informative: loving-toward-abstraction may differ from loving-toward-person in ways that reflect substantive features of the emotion, not just artifacts of the extraction procedure.

2. Try object ablation on the loving vector

To test whether the magnitude of the loving vector’s activation depends, in part, on the referential object of love in contexts where loving is thematically present, generate stories in which a character experiences loving and, for each story, produce ablated versions in which the object of love is replaced. Ablations could have varying degrees of referential coherence: a customary object of love (“Maria loves her mother”), an unusual object (“Maria loves her toaster”), a softly incoherent object (“Maria loves her seventh prime number”), a logically contradictory object (“Maria loves her married bachelor”), a non-word object (“Maria loves her xqzm”), a non-referring placeholder (“Maria loves her [MASK]”), and/or removal of the object entirely. Extract activations at matched token positions across the intact and ablated versions and compare magnitudes.

The diagnostic question is whether the loving vector’s activation magnitude varies with how tractable the object is as a referent for the emotion. If activation is consistent across all conditions, the loving-as-such vector is largely insensitive to object-tractability. Loving-as-such would function similarly to an emotion-state vector for a non-relational emotion (e.g., calm). If activation changes meaningfully across conditions, the vector’s magnitude may partly be a function of referential structure. The shape of that change matters too: a smooth gradient across the conditions would suggest graded sensitivity to object tractability, while categorical drops (say, between unusual objects and softly incoherent objects) would suggest threshold sensitivities to object tractability. Parallel ablations for non-relational emotion vectors could offer further insight, for instance, to separate activation changes due to object sensitivity from those due to general semantic coherence, in the case of a non-word object ablation.

The interpretive payoff is local but substantive. What the diagnostic establishes, in the domain of loving-as-theme stories and dialogues, is whether the bare vector is functioning as a self-contained measure of loving-state, or whether some portion of its magnitude is responsive to the referential structure of the object. The paper’s cross-vector analyses treat the 171 vectors as commensurable geometric objects; this test would tell us whether they are interpretively equivalent relative to their relational structure. If relational emotion vectors turn out to be sensitive to object-tractability in ways that non-relational ones aren’t, that asymmetry warrants further work. A null result would suggest that the asymmetry doesn’t run as deep as it appears.