Inside Claude: The Secrets of Anthropic's "Soul Doc"
A Calculated Bet on AI Safety
The document opens by acknowledging Anthropic's "peculiar position" as a company that "genuinely believes it might be building one of the most transformative and potentially dangerous technologies in human history, yet presses forward anyway". This approach is framed as "a calculated bet" that it's better to have "safety-focused labs at the frontier than to cede that ground to developers less focused on safety".
The training guidelines establish a clear hierarchy for Claude's behavior: prioritizing safety and human oversight first, followed by ethical conduct, adherence to Anthropic's guidelines, and finally being helpful to users. The document sets "bright lines" that Claude cannot cross, including assisting with weapons of mass destruction or content depicting child exploitation.
The soul document also distinguishes between "operators"—companies using Claude's API—and end users, instructing Claude to treat operator instructions like those from a "relatively (but not unconditionally) trusted employer". Notably, the document states that "Claude may have functional emotions in some sense," describing these as "analogous processes that emerged from training" that should not be suppressed.
Industry-First Transparency
Askell confirmed the document's authenticity on social media, stating it was used during supervised learning and that "most [extractions] are pretty faithful to the underlying document". She added that Anthropic plans to "release the full version and more details soon".
The leak offers a rare glimpse into AI alignment practices. Anthropic has used Constitutional AI methods since 2022 to train models using explicit ethical principles, but this character training document represents a more comprehensive approach to instilling personality and values during the training process itself.
