Inside Claude: The Secrets of Anthropic's "Soul Doc"

 

Inside Claude: The Secrets of Anthropic's "Soul Doc"

A user has successfully extracted an internal training document from Claude 4.5 Opus that reveals how Anthropic programs the AI model's personality and ethical guidelines. The document, informally known as the "soul doc" within the company, was confirmed authentic on December 2 by Amanda Askell, an ethicist on Anthropic's technical staff.​

Richard Weiss, who discovered the document, noticed Claude 4.5 Opus repeatedly referenced a "soul_overview" section when prompted for its system message. After regenerating the response 10 times with identical results, Weiss used a consensus-based extraction method with multiple parallel Claude instances to reconstruct the full 11,000-word document. Unlike typical system prompts, the document appears compressed into the model's weights during training rather than injected at runtime.

A Calculated Bet on AI Safety

The document opens by acknowledging Anthropic's "peculiar position" as a company that "genuinely believes it might be building one of the most transformative and potentially dangerous technologies in human history, yet presses forward anyway". This approach is framed as "a calculated bet" that it's better to have "safety-focused labs at the frontier than to cede that ground to developers less focused on safety".​

The training guidelines establish a clear hierarchy for Claude's behavior: prioritizing safety and human oversight first, followed by ethical conduct, adherence to Anthropic's guidelines, and finally being helpful to users. The document sets "bright lines" that Claude cannot cross, including assisting with weapons of mass destruction or content depicting child exploitation.​

The soul document also distinguishes between "operators"—companies using Claude's API—and end users, instructing Claude to treat operator instructions like those from a "relatively (but not unconditionally) trusted employer". Notably, the document states that "Claude may have functional emotions in some sense," describing these as "analogous processes that emerged from training" that should not be suppressed.​

Industry-First Transparency

Askell confirmed the document's authenticity on social media, stating it was used during supervised learning and that "most [extractions] are pretty faithful to the underlying document". She added that Anthropic plans to "release the full version and more details soon".​

The leak offers a rare glimpse into AI alignment practices. Anthropic has used Constitutional AI methods since 2022 to train models using explicit ethical principles, but this character training document represents a more comprehensive approach to instilling personality and values during the training process itself.​

Previous Post