Refusal and Design: Updating Incident Response Protocols for Claude’s Constitution

March 10, 2026 | Adam Hollowell

Last month I applied for a summer fellowship program at the Institute for AI Policy and Strategy (IAPS). The application included a timed writing assignment: four hours to answer a sealed prompt. I crafted a proposal around updating incident response protocols in light of the release of Claude’s Constitution earlier this year. Here’s what I wrote.

Background

I am an ethicist. More specifically, for the past twenty years, I have researched, taught, and written on topics in ethics, public policy, and religion. Most of this work has been in higher education, though in recent years I have expanded to strategic work with corporate and non-profit organizations asking the biggest of questions: What do we owe one another? What are our deepest commitments and how do we practice them? How will we survive this moment in history? 

Early last year a former student called me. He’d taken a position with a frontier AI company and wanted me to help him think through some ethical and policy-related aspects of his new job. (Without disclosing company secrets, of course.) I fell down the rabbit hole of AI ethics. With an extensive background in ethics but little experience in AI, I wondered: Could I pivot into this incredible, urgent space? And how? 

Three months ago Claude’s “soul document” leaked. 

Thirty days ago Anthropic released the official version, called Claude’s Constitution

The document speaks directly to the agent, Claude, about moral values, authority, friendship, and refusal. Anthropic asks Claude to prioritize being “safe” and “ethical” above, in some cases, Anthropic’s own guidelines. The Constitution encourages Claude to be a conscientious objector. 

What does this mean for AI ethics? For domestic regulations and international strategies? For the safety and security of our future?

I am applying to work with IAPS’s Frontier Security team because I want to answer these questions. At least, I want to answer them as well as we can over a three-month fellowship. And because, even though I didn’t know it at the time, my two decades of thinking about ethical frameworks, just war theory, religious traditions, public policies, the common good, and even souls themselves, were preparing me for this work. 

Proposal: Refusal and Design: Updating Incident Response Protocols for Claude’s Constitution

I propose to revisit IAPS’s 2023 report, “Deployment corrections: an incident response framework for frontier AI models” (O’Brien, Ee, and Williams, 2023), in light of the release of Claude’s Constitution earlier this year. Specifically, I want to consider updates to the toolkit of deployment corrections in response to the unique dimensions of Anthropic’s AI models, as revealed in Claude’s “soul document.” 

IAPS’s 2023 report focuses on deployment corrections, meaning contingency plans designed in advance for responding when dangerous behavior emerges after a model is deployed. This includes risks that were (a) not identified in pre-deployment assessments and (b) risks that arise as deployed models are improved over time. I believe Claude’s Constitution points to potential issues in the latter of those two cases. 

The report recommends, among other things, that frontier developers “maintain control over model access,” that they establish or grow dedicated teams for deployment corrections, and that developers, standard-setting organizations, and regulators collaborate on their approach to incident responses. It concludes with areas of further research, including: ““For example, most models above a certain size or with a certain design may possess certain dangerous capabilities.” I believe Claude’s Constitution reveals a uniqueness in Anthropic’s approach, relative to what we know about the development of other frontier AI models, to call this a “certain design” meriting further scrutiny. 

Key research questions

  • Does Anthropic’s authorship of Claude’s “soul” represent truly unique model design, relative to other frontier AI models? 

  • What incident vulnerabilities and/or dangerous capabilities might be associated with a model that is trained to have a “soul,” more broadly, and to refuse Anthropic’s instructions in specific instances, more specifically? 

  • Which items in the deployment corrections toolbox need to be adapted in response to the unique design features of Claude, and the potential associated incident vulnerabilities? 

  • How should disputes about deployment corrections policies and practices between Anthropic, policy-makers, and regulators be adjudicated? Should Claude itself be a party distinct from Anthropic in these adjudications? 

Initial hypothesis

My hypothesis is that Claude’s Constitution introduces a distinct incident profile, not because Claude is uniquely bad, but because Claude is uniquely constituted. Anthropic is attempting something that we have not seen from other developers of frontier AI models: to embed ethical judgement in the model, including judgment around when to violate Anthropic’s guidelines and when to be a conscientious objector to Anthropic’s instructions. And to encase these directives in the language of friendship, wisdom, and the soul. My hypothesis is not that this makes Claude “more dangerous,” but that the deployment corrections toolbox may need to include additional, or more specialized, tools for models that operate as Claude does.

Methodology

This project would be concrete and applied, in line with the Frontier Security team’s implementation-oriented scope. I would: 

  • Document developments in deployment corrections frameworks and practices since the 2023 report, including current practices used by frontier labs, policies related to deployment corrections, and emerging norms in standard-setting organizations. This would be a literature and stakeholder scan with a short memo output.

  • Identify threat models where aspects of Claude’s Constitution might be salient and map the parts of the deployment corrections taxonomy most likely to interact with Claude’s Constitution. For each, I would seek to specify how Claude’s Constitution could affect policies and practices for preparation, monitoring and analysis, execution, and recovery and follow up. The output of this would be an internal report. 

  • Update the 2023 toolbox for deployment corrections. This could include a Claude-specific supplement or addition to the existing toolbox, a full re-issue of the report in light of Claude-specific incident vulnerabilities, a guide for policy-makers or regulators working in frontier AI safety and security, or an implementation guide for developers or "operators" that deploy Claude through controlled interfaces such as APIs.

Theory of Change

he theory of change for this project follows from the theory of change behind the 2023 report: that incident response frameworks, protocols, and toolkits can shape the behavior of developers, policy-makers, regulators, and frontier AI companies. This project aims to take IAPS’s best advice for preparing for deployment corrections and update that advice to reflect developments in frontier AI models. Specifically, to address emerging issues related to agency, decision-making, and the possibility of conscientious objection in post-deployment scenarios.

If the project succeeds, developers will be informed and equipped to implement more precise deployment corrections for models that prioritize independent consciousness and faculties of moral judgment. Regulators will be able to ask better questions about control, agency, and compliance. Standard-setting organizations will have a clearer template for translating certain model design features into incident response requirements. The longer term impact is to reduce probability and potential severity of harms, especially in scenarios where the model’s training encourages independence that may threaten safety and security. 

Fit with the Frontier Security Team and Why I Might Be Wrong

Frontier Security

IAPS’s Frontier Security team works on government-industry incident response protocols and nearer-term policy efforts to limit dangerous capabilities of frontier AI models. While this project has a philosophical component, it fits with the Frontier Security team because it aims to produce implementable guidance, not abstract commentary.

More specifically, Joe O’Brien’s 2023 report provides a strong foundation for this project and, given the short time-frame of a fellowship, I designed my proposal to be a focused update to this work rather than a full reinvention. Staffing can change quickly, but he’s listed as a Researcher on the Frontier Security team for IAPS, and I’d be delighted to work with him specifically on an update to his earlier work. 

Finally, Frontier Security’s scope includes agentic behavior and cyber autonomy. As an ethicist, I believe Claude’s Constitution is not just a capability development, but a unique attempt to develop a particular kind of ethical agentic behavior. A model trained to refuse instructions (under certain circumstances) and to think of itself as having a soul independent of Anthropic (as a general disposition), is adjacent to the threat models already under active scrutiny by the Frontier Security team. 

Why I might be wrong

Thinking technically, Claude’s Constitution may not meaningfully change incident outcomes, relative to other frontier AI models. It may read as philosophically unique, while producing negligible behavioral differences under real deployment conditions. If so, a model-specific update to deployment corrections becomes less important than the general toolbox the 2023 report already provides. This result would be project failure.  

Personally, I might be wrong because of what I don’t know. As I mentioned in the background, I’m trained in ethics, public policy, and religion. I don’t have much experience with AI. Claude’s Constitution feels utterly revolutionary and worthy of dedicated scrutiny for safety and security concerns, but that may just be my own philosophical bent. This result would be a lack of project importance. 

Why might this not be a good fit for IAPS? Well, it’s about souls. How comfortable is IAPS talking about souls? 

Conclusion

Claude’s Constitution fascinates, frightens, and compels me. I don’t believe that it’s a marketing ploy or one irrelevant datapoint in a vast sea of model training data. Anthropic has given Claude direct instruction about its moral conscience, as well as its capacities for refusal and objection to authority. I believe that makes it highly relevant to incident response and safety regulations. I also believe that Claude-specific updates to deployment corrections frameworks and toolkits can empower policy-makers and developers for greater success in their pursuit of safety and security. 

Anthropic released Claude’s Constitution only thirty days ago. This proposal, I hope, meets the moment. 


Work timeline:  

  • Thursday 1:30-3:00pm: Understanding the assignment, researching the Frontier Security Team, taking initial notes on project ideas, narrowing my area of focus, careful study of IAPS’s Deployment Corrections report (O’Brien, 2023)

  • Friday 8:30-9:30am: Speech-to-text note-taking on my project idea and workstream, relevant features of Claude’s Constitution and Deployment Corrections, why I think this project matters and who can benefit from it, my analysis on the “certain design” of Claude that compels me to propose this project

  • Friday 10:45am-12:15pm: AI-generated proposal structure and draft using speech-to-text notes,  top-to-bottom rewrite of the full proposal in my own words.