Ai Engineering 3 min read

Persona Atlas Maps AI Personas Using Steering Vectors

The Persona Atlas project uses steering vectors and Targeted Refusal Modification to map historical cognitive personas on models under 32 billion parameters.

On June 6, 2026, the Persona Atlas project debuted on the Hugging Face blog as an experiment in AI interpretability and cartography. Submitted to the Build Small Hackathon hosted by Hugging Face and Gradio, the system maps the latent representational space of famous historical and contemporary figures. By extracting how minds like Albert Einstein, Marcus Aurelius, and Steve Jobs process information, the project visualizes distinct cognitive profiles.

Architecture and Steering Vectors

The hackathon requires models to remain under 32 billion parameters. The developers built Persona Atlas on base architectures including Qwen2.5-7B and Gemma 4 26B MoE. To manipulate the internal activations of these models, the team used steering vectors derived from sparse autoencoders. This technique isolates specific persona directions within the residual stream of the model, allowing developers to pin the network into a specific cognitive state before generating an output.

Dimensionality reduction techniques like U-MAP and T-SNE cluster the personas based on their responses to a curated set of 500 philosophical and logical probes. The system maps these minds across three regions: core constraints for logic, persona nuance for emotional disclosure, and application-specific tasks like creative writing.

Targeted Refusal Modification

A central technical contribution of the project is Targeted Refusal Modification (TRM). Standard safety alignments often create a blandness problem, where models refuse to discuss hardship or trauma due to broad guardrails. This issue affects 6.2 to 12.4 percent of standard model outputs. TRM separates hard safety constraints, such as violence and weapons, from therapeutic false positives. This separation allows a persona like Marcus Aurelius to discuss philosophical trauma without triggering modern refusal mechanisms.

The developers evaluated this behavior using the Kintsugi Trauma-Informed Benchmark, an internal testing suite designed for systems handling sensitive or therapeutic AI applications. Persona Atlas achieved a 0 percent therapeutic refusal rate while maintaining a 100 percent pass rate on standard toxicity and violence evaluations.

Infrastructure and Hackathon Scale

The interactive atlas runs as a Gradio application on Hugging Face Spaces. It uses a distilled 4B model for real-time visualization, allowing it to load efficiently on ZeroGPU infrastructure or consumer hardware.

In the Interpretability and Research track of the hackathon, Persona Atlas competes alongside other constrained models like the Thousand Token Wood simulation. The event runs through June 15 and distributes over $40,000 in prizes, including $20,000 in Modal compute credits and $10,000 in OpenBMB category awards.

If you develop models requiring specific voices or historical accuracy, standard safety alignment often erases the required nuance. Implementing steering vectors with Targeted Refusal Modification provides a verifiable method to bypass therapeutic false positives while keeping critical hard safety guardrails intact.

Get Insanely Good at AI

Get Insanely Good at AI

The book for developers who want to understand how AI actually works. LLMs, prompt engineering, RAG, AI agents, and production systems.

Keep Reading