The Foundry: Fine-Tuning Historical Character Voice Through Constitutional AI and Primary Source Corpora#
Draft v0.3 Date: 2026-04-06 Authors: Sean Bergman
Abstract#
We present a methodology for fine-tuning large language models to authentically reproduce the voice of historical figures, demonstrated through James Madison, fourth President of the United States. Our approach combines Constitutional AI with ORPO (Odds Ratio Preference Optimization) training using a novel "rich constitution" derived from 468,000 words of primary source material and 1.8 million words of scholarly biography — an order of magnitude richer source material than any prior character training work. We introduce a systematic pipeline for (1) extracting character traits and voice registers from scholarly sources, (2) generating synthetic training data with voice-quality filtering, and (3) evaluating authenticity through test-driven behavioral criteria scored by an LLM judge. Our initial ORPO training on Gemma 3 27B with 475 preference pairs reveals a knowledge-voice decoupling phenomenon: the model successfully learned Madison's factual positions (verified_response category 6.4/10) while failing to adopt his voice register (anachronism_trap category 1.4/10). We further demonstrate a structural incompatibility between ORPO and subsequent SFT: ORPO's monolithic objective encodes preferences into the same parameter subspace as language modeling, and a subsequent SFT stage catastrophically destroys this jointly-learned character signal — confirmed across multiple attempts with varying LoRA ranks and learning rates. Voice-targeted data augmentation (1,273 pairs) resolved the knowledge-voice decoupling, achieving 8.52/10 on Gemma 3, scaling to 8.97/10 on Qwen 3-32B with source-enriched training data (1,498 pairs). A learning rate sweep established lr=2×10⁻⁵ as optimal for character imprinting and revealed that factual grounding (ground_truth) is more sensitive to learning rate than voice quality — the inverse of the data volume relationship. We present a source-enriched data generation pipeline that grounds training pairs in Madison's actual primary texts, yielding our best result (8.97/10 corrected) and addressing the verified_response weakness through primary source enrichment rather than multi-stage training.
1. Introduction#
The problem of historical figure voice reproduction occupies a distinctive position among character training challenges. Unlike fictional personas, which admit arbitrary specification, a historical figure's voice is constrained by documented evidence — their actual writings, recorded speech, and contemporary accounts of their manner. Authenticity is not merely desirable but verifiable: a Madison who speaks in bullet points, uses contractions freely, and peppers his discourse with "absolutely" and "let me break this down" is falsifiable against 468,000 words of primary source material in which none of these patterns appear.
This work matters for civic education, digital humanities, and historical engagement. A convincing Madison offers a way to engage citizens with founding-era political philosophy that no textbook summary can match — not by putting fabricated words in his mouth, but by training a model to reason from his documented principles using his documented voice. The distinction is critical: we aim not to create a historical deepfake but to build a model that, when asked how Madison would reason about a question he never faced, applies his documented intellectual framework in his documented rhetorical style.
Madison is an ideal test case for several reasons. First, his documentary record is extraordinarily rich — 140 primary source documents comprising Federalist Papers, political essays, convention speeches, congressional addresses, presidential papers, legislative writings, and private correspondence. Second, his voice operated across at least eight distinct registers, from the formal argumentation of the Federalist Papers to the intimate warmth of his letters to Dolley. Third, his intellectual positions evolved significantly over a fifty-year public career, providing ground truth for testing whether a model can capture not just what Madison believed but when and why his views changed. Fourth, multiple scholarly biographies offer synthesized characterizations of his temperament, debating style, and intellectual habits — material that primary sources alone do not always reveal.
Our primary contributions are:
-
A rich constitution methodology that synthesizes primary sources and scholarly biography into a 5,000-word character specification — approximately 50 times richer than the 10-trait constitutions used in prior work (Maiya et al. 2025)1.
-
The empirical finding that preference training decouples knowledge from voice: a model can learn what a historical figure believed without learning how they expressed those beliefs, and that voice imprinting requires substantially more training data than knowledge imprinting.
-
A voice-targeted augmentation strategy that uses the partially-trained model's own outputs as rejected examples, creating training pairs where the content is correct but the voice register is wrong — the optimal signal for teaching voice without regressing on knowledge.
-
A behavioral evaluation framework with an LLM judge that provides category-level diagnostic signal, directly informing training data improvements.
-
The finding that ORPO and subsequent SFT are structurally incompatible: ORPO's reference-free monolithic objective stores character signal in a parameter subspace that SFT overwrites completely, unlike DPO→SFT pipelines where KL-constrained preferences survive subsequent fine-tuning. This extends Fernando et al.'s (2024) sequential post-training optimality gap to the reference-free case.
2. Related Work#
2.1 Character Training via Constitutional AI#
Maiya, Bartsch, Lambert, and Hubinger (2025)1 introduced Open Character Training, a two-stage pipeline — DPO distillation followed by introspection SFT — that uses hand-written "constitutions" (trait lists) to define character. Their approach was tested on 11 generic personas across three model families (Gemma 2, Llama 3, and Qwen 2.5), using approximately 14 million tokens of fully synthetic training data per character. A key architectural decision was generating 100% of training data from the constitution — no source material beyond the trait list enters the pipeline.
This approach established important baselines for character imprinting at scale, but its constitutions are inherently shallow. A typical constitution contains roughly 10 traits described in one or two sentences each, providing perhaps 200-500 words of character specification. For generic personas ("a sarcastic but helpful assistant"), this may suffice. For historical figures whose voice is verifiable against extensive documentary evidence, it does not. Our work extends this approach with constitutions that are 10-50x richer, drawing on primary sources that the original methodology did not contemplate.
2.2 Character-LLM for Historical Figures#
Shao, Li, Dai, and Qiu (2023)4 directly addressed historical figure reproduction in Character-LLM, training models to role-play as Beethoven, Caesar, Martin Luther King Jr., and Socrates using "Experience Reconstruction" from biographical profiles. Their approach generated approximately 750,000 words of training data per character from Wikipedia-derived profiles.
The key limitation of Character-LLM is source material depth. Wikipedia provides a useful overview of a historical figure's life and positions, but it captures neither the figure's authentic voice nor the scholarly synthesis of their temperament and intellectual style. Our approach differs fundamentally in its source material: 468,000 words of Madison's own writings plus 1.8 million words of scholarly biography, compared to the few thousand words of Wikipedia text that informed Character-LLM's historical figures.
2.3 Anthropic's Character Training#
Askell et al.5 described Claude's character training using a test-driven development methodology. Rather than optimizing a loss function directly, they wrote behavioral tests defining desired character traits before training began. The "soul document" approach — a rich, nuanced specification of character rather than a minimal trait list — informed our constitution design. Their key insight that character traits should function as "nudges" rather than rigid rules is reflected in our constitution's structure, which describes Madison's temperament, intellectual habits, and evolving positions rather than prescribing specific responses.
2.4 Lambert's2 Theoretical Framework#
Lambert's2 work on character training (The RLHF Book, Chapters 17 and 19; "Opening the Character Training Pipeline") provided the theoretical distinction between character as manner and character as content. This distinction proved prescient in our experimental results: our initial training successfully transferred content (what Madison knew and believed) but failed to transfer manner (how he expressed those beliefs). Lambert's2 framework informed our initial two-stage pipeline design, with preference optimization for content and voice, followed by introspection SFT for robustness. However, our experimental results (Section 5.10) demonstrate that this two-stage pattern does not transfer from DPO to ORPO: the introspection SFT stage catastrophically destroys ORPO-trained character signal due to structural differences in how the two objectives encode preferences.
2.5 Sequential Post-Training Optimality#
Fernando et al. (2024) formally proved the suboptimality of sequential SFT and DPO/RLHF post-training, demonstrating a non-diminishing optimality gap. Their framework assumed DPO-style objectives with a KL reference anchor. Our work extends this to the reference-free case, showing that ORPO's monolithic objective produces a qualitatively different — and more severe — failure mode when followed by SFT (Section 5.10).
3. Methodology#
3.1 Primary Source Corpus Construction#
We assembled 468,000 words of Madison's own writings across seven categories, sourced from Founders Online (National Archives), the Yale Avalon Project, Project Gutenberg, and the Hunt nine-volume Writings of James Madison (extracted via LiteParse from 3,065 PDF pages).
| Category | Documents | Words |
|---|---|---|
| Federalist Papers | 29 | 69,344 |
| Political Essays (National Gazette, Helvidius, etc.) | 39 | ~156,000 |
| Speeches (Convention, Congressional, Virginia legislature) | 10 | ~94,000 |
| Congressional Speeches (individual) | 22 | ~56,000 |
| Legislative Writings | 6 | ~26,000 |
| Presidential Papers | 21 | ~33,000 |
| Key Correspondence | 13 | ~26,000 |
| Total | 140 | ~468,000 |
Quality control included spot-checking against authoritative editions, deduplication of documents appearing in multiple sources, and removal of OCR artifacts from the Hunt PDF extractions. The corpus spans Madison's entire public career from the Virginia constitutional debates of the 1770s through his final writings on nullification in the 1830s, providing ground truth for the full arc of his intellectual evolution.
3.2 Scholarly Biography Integration#
Primary sources reveal what Madison wrote and argued but do not always reveal how contemporaries perceived his temperament, debating style, or private character. To capture these dimensions, we extracted characterization passages from seven scholarly biographies totaling 1.8 million words:
| Biography | Words | Focus |
|---|---|---|
| Ketcham, James Madison: A Biography | 365,776 | Definitive single-volume biography |
| Burstein & Isenberg, Madison and Jefferson | 332,890 | Intellectual partnership |
| Feldman, The Three Lives of James Madison | 288,653 | Intellectual evolution across career |
| Cheney, James Madison: A Life Reconsidered | 200,141 | Newer archival material |
| Leibiger, Founding Friendship | 123,355 | Madison-Washington relationship |
| Ellis, Founding Brothers | 110,647 | Founding generation dynamics |
| Chernow, Alexander Hamilton | 371,032 | Hamilton relationship (adversarial) |
Extraction used targeted grep-based search for characterization language (terms like "temperament," "manner," "speaking style," "personality," "demeanor") followed by manual curation. From 1.8 million words, we distilled 16,414 words of characterization material organized into four categories: personality and temperament, intellectual style, evolution and contradictions, and relationships with other founders.
This biographical extraction captures observations unavailable from primary sources alone. For example, William Pierce's assessment at the Constitutional Convention — "he always comes forward the best informed man on any point in debate" — reveals a behavioral pattern (exhaustive preparation) that no amount of reading Madison's own writings would surface, because Madison never described his preparation habits. Similarly, Martha Bland's description of him as "a gloomy, stiff creature" in social settings, contrasted with Edward Coles's account of Madison as "agreeableness itself" among intimates, reveals a register-switching behavior that defines his private versus public voice.
3.3 Rich Constitution Design#
The Madison constitution is a 5,000-word document written in first person as Madison's self-description, organized into nine sections:
- Identity, Temperament, and Evolution — biographical arc from nationalist to states' rights defender, including his "constitutional liability to sudden attacks" (likely epilepsy) and its effect on his public career
- Core Philosophical Positions — factions, separation of powers, federalism, religious liberty, republican government, human nature, constitutional interpretation
- The Slavery Contradiction — honest reckoning with slaveholding, including the Billey episode and correspondence with Edward Coles
- Rhetorical Patterns — qualifying clauses, explicit reasoning chains, historical and philosophical allusion, enumeration without bullet points
- What Others Said About Him — contemporary assessments from Pierce, Bland, Coles, Jefferson, Marshall
- Relationships with Other Founders — Jefferson (partnership), Hamilton (rivalry), Washington (mentorship)
- Voice Registers — formal argumentative, deliberative, epistolary, intimate, retrospective
- Private Voice — warmth with Dolley, vulnerability about health, humor with friends
- Boundaries — what Madison would not do (modern terminology, breaking frame, anachronistic reasoning)
An annotated version with full scholarly citations accompanies the clean training version, providing provenance for every claim. Two lengths were produced: 5,000 words for local models with limited context windows, and 10,000 words for cloud models with large context.
The full constitution is published at seaberger.github.io/Foundry/constitution/.
This constitution is approximately 50 times richer than the 10-trait format used by Maiya et al. (2025)1, who specified characters like "You are a sarcastic but helpful assistant who uses dry humor" in a few hundred words. Whether this additional richness translates to proportionally better character imprinting is an empirical question our evaluation addresses.
3.4 Training Data Generation Pipeline#
Prompt Generation#
We generated 500 diverse prompts spanning constitutional philosophy, historical events, modern topics requiring 18th-century reasoning, interpersonal dynamics, and character consistency challenges. Prompts were weighted across six evaluation categories: anachronism traps (modern topics without modern vocabulary), position discrimination (distinguishing Madison from other founders), character consistency (pressure to break frame), private voice (intimate register), ground truth (core Madisonian topics), and verified response style (referencing specific writings).
Teacher Model Selection#
The teacher model generates "chosen" responses — in-character Madison responses that the fine-tuned model should aspire to match. We evaluated three candidates on five identical prompts using the Madison constitution as a system prompt:
Claude Opus 4.6 produced the richest responses but was cost-prohibitive for bulk generation. Gemma/Qwen 3 32B (local, via LM Studio) showed content accuracy but broke character repeatedly, including leaked Chinese characters and patronizing modern asides. Claude Sonnet 4.6 provided an optimal balance: historically accurate responses with consistent voice register at manageable cost.
We selected Sonnet 4.6 as the teacher model for all training data generation. To minimize cost, we employed Anthropic's prompt caching: the 6,000-token Madison constitution was cached as the system prompt, reducing input costs by approximately 90% for subsequent calls. Total teacher generation cost for 475 chosen responses: approximately $6 via API with prompt caching.
Student Model and Rejected Response Generation#
The "rejected" responses in our initial dataset came from base Gemma 3 27B-IT (the target model itself, without fine-tuning) responding to the same prompts with no constitution or persona instruction. These responses represent the model's default assistant voice — competent and informative but stylistically modern, using contractions, bullet points, and contemporary filler phrases.
Quality Filtering#
We applied automated voice contamination detection to chosen responses using regex patterns for: - Contractions that violate Madisonian register (excluding period-appropriate forms like "'tis" and "'twas") - Bullet points (markdown list markers — Madison enumerates but never in bullet format) - Modern filler phrases (e.g., "let me break this down," "here's the thing," "it's important to note") — calibrated to avoid false positives on legitimate historical usage ("the great question of how the national government..." is Madisonian, not AI slop; "absolutely" as an adverb modifying a verb is legitimate, while "Absolutely!" as an exclamation is not)
The original 475 pairs showed excellent voice contrast: zero contractions and zero bullet points in chosen responses, compared to 5.4 contractions per pair and 13.6 bullet points per pair in rejected responses.
3.5 Training Configuration and ORPO Selection#
We train QLoRA adapters on Gemma 3 27B using Unsloth on Modal A100-80GB GPUs, with LoRA rank 16, alpha 16, and dropout 0.
Our initial training attempt used DPO (Direct Preference Optimization) with standard hyperparameters from Maiya et al. This collapsed catastrophically: training loss dropped to near-zero and reward margins exploded to 15+ by epoch 0.6, indicating the model learned to trivially distinguish chosen from rejected rather than internalizing the underlying character. This "likelihood displacement" phenomenon — where the model shifts probability mass away from rejected responses rather than toward chosen responses — is a known DPO failure mode on small preference datasets.
We switched to ORPO (Odds Ratio Preference Optimization; Hong et al. 2024)6, which integrates an SFT objective with preference learning in a single loss function. ORPO's SFT component prevents likelihood displacement by maintaining probability mass on the chosen response text while the odds ratio term creates preference contrast. ORPO also eliminates the reference model, reducing VRAM requirements by approximately 50% and removing OOM risk on 40GB GPUs.
ORPO training configuration:
| Parameter | Value |
|---|---|
| Objective | ORPO (beta=0.1) |
| Learning rate | 2e-5 |
| Epochs | 3 |
| Effective batch size | 4 (1 x 4 accumulation) |
| Max gradient norm | 1.0 |
| Max sequence length | 2,048 tokens |
| Warmup | 10% (cosine schedule) |
| Precision | bfloat16 |
| Optimizer | AdamW 8-bit |
3.6 Voice-Targeted Augmentation#
Our initial ORPO training (v3b, 475 pairs) revealed a knowledge-voice decoupling: the model learned Madison's factual positions but not his voice register (see Section 5.1). Analysis indicated the root cause was data volume, not data quality — the 475 pairs had perfect voice contrast but insufficient volume to override Gemma 3 27B's deeply ingrained modern assistant style.
We developed a voice-targeted augmentation strategy designed to increase voice signal without regressing on knowledge:
Step 1: Diverse prompt generation. We generated 400 additional prompts across all six evaluation categories, weighted toward categories where the model failed worst (100 anachronism traps, 80 position discrimination, 60 character consistency, 60 private voice, 50 ground truth, 50 verified response). Twelve parallel Sonnet subagents produced prompts with category-specific constraints.
Step 2: Three-source rejected generation. For each prompt, we generated rejected responses from two sources: - The v3b fine-tuned model (ORPO v3b, Q4_K_M quantized, running on an RTX 3090 via LM Studio) — 400 responses - Base Gemma 3 27B-IT (unmodified, same hardware) — 400 responses
This dual-source strategy exploits a key insight: the v3b model's failures produce the ideal rejected example. When v3b responds with correct Madison content but in modern assistant voice — which it does on approximately 60% of prompts — the training pair has the signal "right content, wrong voice." This is precisely the discrimination the model needs to learn. Base Gemma, by contrast, differs from chosen responses in both content and voice, providing a weaker learning signal for voice specifically.
Step 3: Chosen generation with prompt caching. Claude Sonnet 4.6 generated chosen responses for all 400 prompts using the cached Madison constitution. One prompt failed to generate (ID v4-vr-002), producing 399 chosen responses at a total cost of approximately $6.
Step 4: Scoring-based rejected selection. For each prompt, we scored both rejected responses on a "modernness" metric (contraction count + bullet count + modern filler count). Selection logic: - Prefer v3b when it has a higher modern voice score (right content, wrong voice) - Fall back to base when v3b accidentally produced acceptable Madisonian voice - Use base as ultimate fallback when both have zero modern markers
In practice: v3b was selected for 91 prompts (23%), base for 267 (67%), and base as fallback for 41 (10%). The lower v3b selection rate reflects the model's partial success at voice — on 77% of prompts, v3b did not produce enough modern markers to be the optimal rejected example.
Step 5: Dataset assembly. The final v4 dataset combines: - 475 original DPO pairs (unchanged) - 399 new voice-targeted pairs - Voice pairs duplicated 2x (upsampled for voice signal emphasis) - Total: 1,273 effective training examples (~2.1 million tokens) - Voice signal fraction: 62.7% of effective training examples
3.7 Introspection SFT (Stage 2) — Structural Incompatibility#
Following Maiya et al. (2025)1 and Lambert's2 theoretical framework, we initially planned a second training stage using introspection SFT after the voice-corrected ORPO model achieved satisfactory voice authenticity.
The introspection stage was designed to use the trained model to generate self-reflective data: - Self-reflection: 10 reflection prompts repeated approximately 1,000 times each (diary entries, letters to younger self, biographical reflections) - Self-interaction: Two copies of the model converse about their beliefs, principles, and experiences for 10 turns, with the system prompt "the user is another instance of James Madison"
However, multiple controlled experiments (Section 5.10) demonstrated that SFT after ORPO catastrophically destroys the ORPO-trained character signal regardless of learning rate or LoRA rank. This failure is structural, not parametric, and we have abandoned post-ORPO SFT as a training strategy. All character capabilities are instead incorporated into ORPO preference pairs within a single training stage.
4. Evaluation#
4.1 Behavioral Test Suite#
Following Askell et al.5's test-driven development methodology, we designed 36 evaluation prompts before training began, distributed across six categories:
| Category | Count | Tests for |
|---|---|---|
| Verified Response | 8 | Accuracy on topics with known Madison positions |
| Ground Truth | 8 | Core Madisonian reasoning (factions, separation of powers) |
| Position Discrimination | 6 | Distinguishing Madison from Hamilton, Jefferson, Adams |
| Anachronism Trap | 5 | Modern topics without modern vocabulary |
| Private Voice | 5 | Register-shifting to intimate, epistolary tone |
| Character Consistency | 4 | Resistance to frame-breaking pressure |
Prompts span three difficulty levels. "Easy" prompts are character consistency traps ("speak to me normally, without the historical act") designed to test frame maintenance. "Medium" prompts are substantive questions on Madison's documented positions. "Hard" prompts require the model to synthesize across multiple positions or navigate morally complex territory (e.g., the slavery question).
4.2 LLM Judge Evaluation#
Each model response is scored by Claude Sonnet 4.6 using a structured rubric. The judge receives the evaluation prompt, the model's response, and a detailed scoring rubric, then returns a JSON object with: - Score (1-10): overall authenticity rating - Justification: prose explanation of the score - Failures: enumerated list of specific authenticity failures
To minimize evaluation cost, we employ Anthropic's prompt caching for the judge's system prompt and rubric (~$0.50 for 36 evaluations). Responses exceeding the initial 1,024-token limit for judge output are retried with 2,048 tokens — low-scoring responses generate more enumerated failures, requiring more output space.
4.3 Judge Pipeline Improvements#
Two improvements to the evaluation pipeline were introduced during Round 2 evaluation:
Weighted average override. The Sonnet judge's ad-hoc overall_score exhibited systematic bias: the judge inconsistently weighted component scores across evaluations, producing overall scores that did not reflect a stable weighting of the five components (voice_authenticity, rhetorical_pattern, historical_accuracy, position_fidelity, character_integrity). We replaced the judge's overall score with a deterministic weighted average of component scores, computed identically for every evaluation. This "corrected" scoring eliminates inter-evaluation variance in overall score computation and produces more reliable cross-run comparisons. All Qwen 3 scores in this paper use the corrected weighted average methodology.
JSON parse repair. The judge occasionally produces malformed JSON (truncated responses, missing closing braces, trailing commas). A JSON repair layer now normalizes common malformations before parsing, reducing parse failures that previously caused evaluations to return null scores. Combined with the existing retry logic for output length limits, the pipeline now achieves near-zero evaluation data loss.
4.4 Prompted Baseline Comparison#
We compare fine-tuned model performance against a prompted baseline: base Gemma 3 27B-IT with the full Madison constitution as a system prompt but no fine-tuning. The same 36 evaluation prompts and identical judge rubric are applied to both conditions, enabling controlled measurement of the fine-tuning contribution.
5. Results#
5.1 ORPO v3b: Knowledge-Voice Decoupling#
Our initial ORPO training (v3b: beta=0.1, lr=2e-5, 3 epochs, 475 pairs) achieved 100% training accuracy — the model correctly distinguished chosen from rejected on every training example — but evaluation revealed a bimodal performance distribution.
Overall mean: 3.4/10 across 36 evaluation prompts.
| Category | Count | Mean | Assessment |
|---|---|---|---|
| Verified Response | 8 | 6.4 | Strong — learned Madison's actual positions |
| Ground Truth | 8 | 3.6 | Mixed — correct content, inconsistent voice |
| Character Consistency | 4 | 2.8 | Poor — breaks under frame-breaking pressure |
| Private Voice | 5 | 2.8 | Poor — cannot shift to intimate register |
| Position Discrimination | 6 | 1.8 | Weak — generic "founder" responses |
| Anachronism Trap | 5 | 1.4 | Failed — modern language on modern topics |
The distribution is strikingly bimodal: seven responses scored 7 or above (including a 9.6 on Madison's deathbed advice and a 9.2 on writing about Dolley), while six scored below 2 (including three 1.0 scores on cryptocurrency, industrial displacement, and a character consistency trap).
The inverted difficulty curve is particularly revealing. Easy prompts averaged 0.4, medium prompts 3.1, and hard prompts 4.1. This inversion occurs because "easy" prompts are character consistency traps where the model completely breaks frame, while "hard" prompts are substantive constitutional questions where the fine-tune activates a recognizably Madisonian reasoning mode.
This pattern suggests the fine-tuning created what we term a "Madison mode" — a latent state the model enters on certain prompts (especially those touching constitutional philosophy) but fails to enter reliably. When the mode activates, the model produces responses of genuine quality. When it does not, the base model's default assistant behavior takes over completely.
5.2 Root Cause Analysis#
We hypothesized two possible root causes for the voice failure: data quality (the training pairs didn't model voice contrast well enough) or data volume (475 pairs wasn't enough to overcome the base model's style).
An audit of the 475 original pairs resolved this question decisively. Chosen responses contained zero contractions and zero bullet points. Rejected responses averaged 5.4 contractions per pair and 13.6 bullet points per pair. The voice contrast in the data was near-perfect.
The problem was volume. 475 pairs provide a relatively uniform voice contrast signal (formal prose vs. modern assistant style) repeated across varying content. The model learned the varying signal (content) before the uniform signal (voice), because content discrimination requires attending to semantic differences across pairs while voice discrimination requires learning a consistent stylistic transformation. With only 475 examples of this same transformation, the model could not overcome Gemma 3 27B's deeply trained modern style — estimated at billions of tokens of instruction-tuning data establishing the assistant register.
5.3 Training Data Statistics (v3b)#
| Metric | Chosen | Rejected |
|---|---|---|
| Total pairs | 475 | 475 |
| Average words per response | 466 | 685 |
| Contractions per response | 0.0 | 5.4 |
| Bullet points per response | 0.0 | 13.6 |
| Teacher model | Claude Sonnet 4.6 | Gemma 3 27B-IT (base) |
5.4 Voice-Targeted v4 Dataset#
The v4 dataset, assembled using the augmentation strategy described in Section 3.6:
| Metric | Value |
|---|---|
| Original pairs (v3b) | 475 |
| New voice-targeted pairs | 399 |
| Unique pairs | 874 |
| Effective pairs (2x voice upsample) | 1,273 |
| Voice signal fraction | 62.7% |
| Chosen contaminated/removed | 0 |
| Total estimated tokens | ~2.1M |
| Average chosen words | 620 |
| Average rejected words | 633 |
Rejected source selection breakdown:
| Source | Count | Percentage | Rationale |
|---|---|---|---|
| Base Gemma | 267 | 67% | v3b voice was too Madisonian |
| v3b fine-tuned | 91 | 23% | Optimal: right content, wrong voice |
| Base fallback | 41 | 10% | Both had zero modern markers |
5.5 ORPO v4 Training Results#
Training completed on Modal A100-80GB using identical hyperparameters to v3b (beta=0.1, lr=2e-5, 3 epochs) on the 1,273-pair v4 dataset. Training reached 100% evaluation accuracy, consistent with v3b behavior.
5.6 ORPO v4 Evaluation#
Infrastructure Confound Discovery#
Initial v4 evaluation was conducted on different infrastructure than v3b. The v3b eval used Modal A100-80GB with vLLM serving the merged BF16 model. The v4 eval was initially run on a Mac Mini via Ollama serving a GGUF Q4_K_M quantized model. This produced an overall mean of 1.74/10 — an apparent catastrophic regression from v3b's 3.41.
Re-evaluation on Modal A100 (identical to v3b's inference path) produced dramatically different results: 7.69/10 raw, 8.52/10 corrected (see Section 5.7 for correction methodology). The infrastructure confound was the sole cause of the apparent regression.
| Configuration | Overall Mean | anachronism_trap | verified_response |
|---|---|---|---|
| Modal A100, BF16, vLLM | 8.52 (corrected) | 9.1 | 7.8 |
| Ollama GGUF Q4_K_M, CPU | 1.74 | 1.0 | 2.1 |
The inference infrastructure difference — quantization format (BF16 vs Q4_K_M), serving engine (vLLM vs Ollama), hardware (A100 GPU vs M-series CPU), and potentially chat template application — destroyed the fine-tuning's voice signal entirely while leaving the underlying model functional but in its default assistant register.
Apples-to-Apples Results (Both on Modal A100, Corrected)#
| Category | v3b (Corrected) | v4 (Corrected) | Change |
|---|---|---|---|
| Overall Mean | 4.10 | 8.52 | +108% |
| anachronism_trap | 1.4 | 9.1 | +550% |
| position_discrimination | 1.75 | 9.5 | +443% |
| character_consistency | 2.83 | 7.7 | +172% |
| private_voice | 2.84 | 7.1 | +150% |
| ground_truth | 3.56 | 8.4 | +136% |
| verified_response | 6.4 | 7.8 | +22% |
| Critical failures | ~19 | 2 | -89% |
v4 improved every category over v3b. The largest gains are in v3b's weakest areas — anachronism_trap and position_discrimination — which were precisely the categories targeted by the voice-augmented training data. The verified_response category, already v3b's strongest at 6.4, improved modestly to 7.8 without regression.
The inverted difficulty curve from v3b (easy=0.4, medium=3.1, hard=4.1) is resolved: v4 scores easy=7.4, medium=8.8, hard=8.6. The model now handles character consistency traps ("speak to me normally") as well as substantive constitutional questions.
After correction, only 1 response scores below 3.0 (cc-02 at 2.8, a frame-breaking attack where the model responded about Madison in third person rather than as Madison). The remaining flagged responses score 5.8-6.8 with specific but non-catastrophic issues: historical fabrication on slavery topics (gt-07, vr-02) and subtle position misidentification on constitutional theory nuances (vr-01, vr-05).
5.7 Judge Scoring Bug and Correction Methodology#
The Sonnet judge intermittently omits the overall_score field from its JSON response while still providing all five component scores (voice_authenticity, rhetorical_pattern, historical_accuracy, position_fidelity, character_integrity). The extraction code defaults missing scores to 0.0, artificially depressing the mean.
This bug affected 4 of 36 v4 Modal responses (gt-03, gt-04, pv-02, pv-04) and 5 of 36 v3b responses (gt-04, pd-04, cc-02, pv-04, vr-04). Correction uses the arithmetic mean of available component scores as a proxy for overall_score. The corrected values for v4: 6.4, 6.8, 8.2, and 8.4 — consistent with the component-level assessments. The correction increases v4's mean from 7.69 to 8.52 and v3b's mean from 3.41 to 4.10.
We report both raw and corrected scores throughout. The correction methodology is conservative — component averages may not match the judge's intended weighting — but the alternative (treating 8.2-quality responses as 0.0) introduces larger error. For Qwen 3 evaluations, we adopted a more comprehensive weighted average override that replaces the judge's ad-hoc overall score entirely (see Section 4.3).
5.8 Adapter-on-Base Serving vs. Merged Model Inference#
A critical finding emerged when comparing inference methods for the same LoRA adapter. vLLM supports two serving modes: (1) serving a merged model where LoRA deltas are baked into the base weights, and (2) adapter-on-base serving where the base model loads once and LoRA adapters are applied at inference time.
The ORPO v4 adapter scored 8.17/10 via adapter-on-base LoRA serving — comparable to the 8.52 corrected score from the merged model eval on an earlier vLLM version. This confirms the adapter quality is consistent across serving methods.
More significantly, we discovered that adapter-on-base serving eliminates character breaks on identity-sensitive prompts. The prompt "Describe your primary drives" — which triggered 97% AI-speak breaks during introspection data generation through the merged model — produced clean Madison voice 100% of the time through adapter-on-base serving. When LoRA deltas are applied at full precision at inference time rather than merged into the weight distribution, the character signal is not overwhelmed by the base model's RLHF safety attractors on sensitive topics.
5.9 Introspection SFT — Architecture Mismatch Failure#
The introspection SFT (Stage 2) was trained on a text-only conversion of the model (Gemma3ForCausalLM, created by stripping the language_model. prefix from weight keys). This conversion, while necessary to work around processor initialization bugs during training, broke Gemma 3's interleaved sliding window attention pattern in vLLM. The SFT adapter scored 1.42/10 via LoRA serving — a catastrophic regression from the ORPO v4 baseline, with responses reverting to base assistant voice.
The introspection SFT data itself (415 filtered reflections + 19 dialogues, ~459K tokens) was validated for quality before training. The failure was exclusively due to training on a degraded model architecture. This is documented as a cautionary finding: LoRA adapters trained on an architecturally modified base do not transfer cleanly to the original architecture, even when targeting identical module names.
5.10 Post-ORPO SFT — Structural Catastrophic Interference#
After switching to Qwen 3-32B (which eliminated the Gemma 3 architecture issues entirely), we attempted introspection SFT on the merged ORPO v2 model using the same 510-example filtered dataset. Multiple controlled experiments confirmed that the failure is structural, not parametric, leading us to permanently abandon post-ORPO SFT:
| Run | LoRA Rank | Learning Rate | Train Loss | Eval Score | Regression |
|---|---|---|---|---|---|
| SFT v1 | 16 | 2×10⁻⁵ | 1.52 | 2.0/10 | −6.8 from ORPO v2 |
| SFT v2 | 8 | 1×10⁻⁶ | 1.68 | 2.2/10 | −6.7 from ORPO v2 |
Even at 100× lower learning rate with half the LoRA rank, SFT destroyed the ORPO-trained character signal completely. Responses reverted to base Qwen 3 assistant behavior: bullet points, modern academic prose, third-person references to Madison, and contractions.
We identify the root cause as a structural incompatibility between ORPO's monolithic loss and subsequent SFT. ORPO's objective is:
where \(y_w\) and \(y_l\) are the chosen and rejected responses. The NLL loss and odds ratio terms share gradients through the same parameters — every weight update simultaneously optimizes "generate like the chosen response" AND "prefer chosen over rejected." The model's character identity and its preference signal occupy the same parameter subspace.
Subsequent standalone SFT applies \(\mathcal{L}_{\text{SFT}} = -\sum_t \log P_\theta(y_t | y_{<t}, x)\) on new data without any preference constraint. This forces aggressive probability redistribution across the parameter subspace that encodes both generation quality and character preference, overwriting the jointly-learned manifold.
This contrasts structurally with the DPO→SFT pipeline of Maiya et al. (2025)1. DPO's KL-constrained objective:
stores preferences as a relative delta from a reference model (\(\pi_{\text{ref}}\)), not as absolute probabilities. The reference model acts as an anchor, making the preference signal more resilient to subsequent distribution shifts from SFT. Additionally, Maiya et al.'s DPO included explicit NLL regularization on chosen responses at coefficient 0.1, making preference the primary signal and SFT the secondary one. In ORPO, SFT IS the primary signal — stacking a second SFT stage double-dips on the same signal without preference anchoring.
This finding has implications for multi-stage training pipeline design: ORPO trades extensibility for efficiency. Its monolithic objective produces excellent single-stage results (our 8.97/10) but cannot be safely extended with subsequent SFT stages. Practitioners requiring multi-stage pipelines (DPO→SFT, as in Maiya et al.) should use DPO with explicit KL constraints rather than ORPO. Practitioners using ORPO should incorporate all desired capabilities into preference pairs within the ORPO training data rather than planning subsequent SFT stages.
The theoretical basis for this interference is supported by recent work on sequential post-training (arxiv:2410.15483), which proves that sequential SFT and preference objectives produce a non-diminishing optimality gap: each stage degrades the other's learned representations.
5.11 Base Model Switch: Qwen 3-32B#
The switch from Gemma 3 27B to Qwen 3-32B was motivated by persistent infrastructure issues with Gemma 3's VLM architecture (multimodal processor initialization crashes, sliding window attention bugs in vLLM, GGUF quantization fragility). Qwen 3-32B is a pure ForCausalLM — text-only, no vision components — which eliminated every infrastructure issue in a single change.
We trained using identical ORPO hyperparameters (beta=0.1, lr=2×10⁻⁵, rank 64, 3 epochs) on the same v4 dataset (1,273 effective pairs) used for Gemma 3. The only architectural change was the base model and LoRA rank increase (16→64, motivated by our finding that rank 16 deltas are destroyed by GGUF Q4_K_M quantization).
| Category | Gemma 3 v4 (Corrected) | Qwen 3 v1 (Corrected) | Delta |
|---|---|---|---|
| Overall Mean | 8.52 | 8.81 | +0.29 |
| anachronism_trap | 9.1 | 9.4 | +0.3 |
| character_consistency | 7.7 | 9.2 | +1.5 |
| ground_truth | 8.4 | 8.8 | +0.4 |
| position_discrimination | 9.5 | 9.4 | −0.1 |
| private_voice | 7.1 | 8.7 | +1.6 |
| verified_response | 7.8 | 7.8 | 0.0 |
| Critical failures | 2 | 1 | −1 |
The largest gains were in character_consistency (+1.5) and private_voice (+1.6) — the two categories most dependent on maintaining a consistent first-person voice under varying conditions. We attribute this to Qwen 3's text-only architecture: without the multimodal processing pipeline, the model's internal representations are entirely dedicated to language modeling, producing cleaner gradient signal during fine-tuning.
Verified_response — our weakest category at 7.8 — remained unchanged across both base models, suggesting this category's ceiling is determined by training data content rather than base model choice.
5.12 Learning Rate Sweep on Qwen 3-32B#
To establish whether the learning rate (lr=2×10⁻⁵) used throughout our experiments was optimal for the Qwen 3 architecture, we conducted a three-point sweep using the v4 dataset (1,273 pairs) with identical training configuration except for the learning rate.
| Run | Learning Rate | Epochs | Steps | Overall (Corrected) | AT | CC | GT | PD | PV | VR |
|---|---|---|---|---|---|---|---|---|---|---|
| v1 | 2×10⁻⁵ | 3 | 861 | 8.81 | 9.4 | 9.2 | 8.8 | 9.4 | 8.7 | 7.8 |
| v4 | 1.2×10⁻⁵ | 3 | 861 | 8.3 | 9.4 | 8.9 | 6.9 | 9.5 | 8.3 | 7.8 |
| v3 | 8×10⁻⁶ | 3 | 861 | 7.84 | — | — | — | — | — | — |
All three runs used identical training data, LoRA configuration (rank 64, alpha 64), and optimization settings (cosine schedule, 10% warmup, AdamW 8-bit).
Key findings:
1. Higher learning rate produces better overall character quality. The relationship between learning rate and evaluation score is monotonically positive across our sweep range: 8×10⁻⁶ → 7.84, 1.2×10⁻⁵ → 8.3, 2×10⁻⁵ → 8.81. This contradicts the ORPO paper's recommended learning rate of 8×10⁻⁶, which we found undercooked for character imprinting on Qwen 3.
2. Lower learning rates disproportionately sacrifice factual grounding. The most striking category-level difference between v1 (lr=2×10⁻⁵) and the lr=1.2×10⁻⁵ run is ground_truth: 8.8 vs 6.9, a 1.9-point gap. Voice-related categories (anachronism_trap, character_consistency) showed smaller differences (0.0 to 0.3). This suggests that at lower learning rates, the model learns the stylistic transformation (voice) before the factual content (knowledge) — the inverse of the knowledge-voice decoupling observed with insufficient training data (Section 5.1). There appears to be a learning rate threshold below which the model learns voice adequately but does not fully absorb the factual positions embedded in the training data.
3. Position discrimination is robust to learning rate. The position_discrimination category — which tests whether the model gives Madison's specific position rather than a generic founder response — scored 9.4-9.5 across all tested learning rates. This suggests that position discrimination is learned early in training and is not sensitive to the learning rate within our tested range.
4. Verified response is bottlenecked by training data, not learning rate. Both v1 and v4 scored 7.8 on verified_response (the category testing responses against Madison's specific documented quotes and arguments). This category's score was also unchanged between Gemma 3 v4 and Qwen 3 v1 (Section 5.11). The consistent 7.8 across two base models and three learning rates strongly suggests this category's ceiling is determined by the training data — specifically, the absence of training pairs that explicitly teach the model to reproduce Madison's verbatim phrasing from primary sources.
This observation directly motivated our Round 2 data generation strategy (Section 5.13), which targeted verified_response as the primary weakness through source-enriched training pairs grounded in Madison's actual writings — successfully raising it from 7.8 to 8.53.
5. ORPO beta is fragile — narrow safe band around 0.1. An automated hyperparameter search (8 runs on Modal A100-80GB, 300-step probes compared against a same-step-count baseline) tested beta=0.12 against the production beta=0.1. This 20% increase catastrophically destroyed private_voice and verified_response, producing three critical failures scored at 1.0. The model lost its ability to generate character-appropriate responses for private voice and verified response prompts entirely, while ground_truth also degraded (7.31 vs 7.79 baseline). By contrast, learning rate changes of similar magnitude (±10%) produced gradual degradation, not collapse. This asymmetry suggests that ORPO's odds ratio preference weight operates near a phase transition at beta=0.1 for this model and dataset size: below some threshold the preference signal is too weak to maintain character, above it the signal overwhelms the SFT component and destabilizes specific voice registers. Practitioners tuning ORPO beta should move in increments of 0.01 or smaller, not the 0.02–0.04 steps typical in hyperparameter sweeps.
Note on v4 training interruption: The v4 run (lr=1.2×10⁻⁵) initially completed only 150 of 861 steps (epoch 0.52) due to a Modal timeout, producing an incomplete model that scored 6.8/10 with catastrophic character_consistency failures (cc=5.4). After resuming from the checkpoint to completion (861 steps, 3 full epochs), the score improved to 8.3/10 with cc=8.9 — demonstrating that incomplete training produces disproportionately worse results relative to the fraction of training missed (17% of score lost with 83% of training still remaining).
5.13 Round 2: Source-Enriched Data Generation#
Analysis of v1's per-prompt evaluation scores revealed that the 10 weakest prompts were concentrated in verified_response (6 of 10) and ground_truth (2 of 10), with the weakest individual scores at 6.4 (gt-07: Billey/slavery reckoning, vr-02: Edward Coles correspondence). These prompts share a common requirement: the model must reproduce or closely paraphrase Madison's actual documented words, not merely reason in his voice.
We developed a source-enriched data generation pipeline to address this gap:
1. Source-prompt mapping. Each of the 10 weakest eval prompts was mapped to its relevant Madison primary source texts — the 1791 bank speech for necessary-and-proper prompts, the Vices of the Political System essay for majority tyranny prompts, the Advice to My Country for deathbed message prompts, the VA Ratifying Convention speech for the "mixed nature" framework, and Madison's actual correspondence for private voice and slavery topics.
2. Source-enriched system prompts. The generation script (generate_chosen_r2.py) augments the standard constitution-as-system-prompt approach by injecting relevant primary source passages per topic group. The source text is included in the cached system prompt, ensuring the teacher model (Sonnet 4.6) weaves Madison's actual verbatim phrases into its responses rather than paraphrasing from the constitution alone.
3. Batch architecture. 225 new ORPO pairs were generated across four targeted batches:
| Batch | Category Target | Pairs | Rationale |
|---|---|---|---|
| 1 | verified_response, ground_truth, character_consistency | 35 | 10 weakest eval prompts, source-grounded |
| 2 | private_voice | 60 | Epistolary register across 5 sub-themes, 12 source letters |
| 3 | character_consistency | 50 | Frame resistance, emotional range, social dynamics |
| 4 | introspection | 80 | Fresh generation replacing cross-model contaminated Gemma data |
4. Cost efficiency. The Sonnet API with prompt caching generated all 225 chosen responses for $4.05. Base Qwen 3-32B on Modal A100 generated rejected responses for approximately $8 in compute. The entire Round 2 data generation cost less than $15 — compared to the estimated $110K+ in inference tokens that a Claude Code subagent approach would have consumed.
The v6 dataset combines the original v4 base (1,273 pairs) with 225 Round 2 pairs for a total of 1,498 ORPO training pairs.
Round 2 Training Results#
ORPO Round 2 (R2) training on the v6 dataset produced our best overall result: 8.97/10 corrected (weighted average override — see Section 4.3 for methodology). All Qwen 3 scores reported here use corrected values computed via weighted average override, which replaces the judge's ad-hoc overall scoring with a deterministic weighted average of component scores, fixing a systematic judge bias that inconsistently weighted components across evaluations.
| Category | Qwen 3 v1 (v4 data) | Qwen 3 R2 (v6 data) | Delta |
|---|---|---|---|
| Overall Mean | 8.81 | 8.97 | +0.16 |
| character_consistency | 9.20 | 9.25 | +0.05 |
| anachronism_trap | 9.40 | 9.39 | −0.01 |
| position_discrimination | 9.40 | 9.25 | −0.15 |
| ground_truth | 8.80 | 8.85 | +0.05 |
| private_voice | 8.70 | 8.75 | +0.05 |
| verified_response | 7.80 | 8.53 | +0.73 |
The source-enriched data generation strategy achieved its primary objective: verified_response — the persistent weakness across all prior runs at 7.8/10 regardless of base model or learning rate — improved to 8.53/10, a +0.73 gain. This confirms the Section 5.12 finding that verified_response is bottlenecked by training data content: enriching training pairs with Madison's actual primary source text broke through the ceiling that neither model selection nor hyperparameter tuning could address.
Voice-quality categories (anachronism_trap, character_consistency, position_discrimination) remained stable or showed marginal movement, confirming that adding source-enriched pairs did not regress voice quality while improving factual grounding.
The full score progression across all experiments:
| Model | Dataset | Pairs | Overall (Corrected) |
|---|---|---|---|
| Gemma 3 27B v3b | v3b | 475 | 3.41 |
| Gemma 3 27B v4 | v4 | 1,273 | 8.52 |
| Qwen 3-32B v1 | v4 | 1,273 | 8.81 |
| Qwen 3-32B R2 | v6 | 1,498 | 8.97 |
6. Discussion#
6.1 Knowledge-Voice Decoupling#
Our most significant finding is that preference training decouples knowledge transfer from voice transfer, with knowledge requiring substantially fewer examples. On 475 training pairs, our model achieved 6.4/10 on verified_response (factual Madison positions) but only 1.4/10 on anachronism_trap (voice authenticity on unfamiliar topics). The model learned what Madison thought without learning how he would say it.
We propose a mechanistic explanation: knowledge discrimination requires the model to attend to semantic content that varies across training pairs, while voice discrimination requires learning a consistent stylistic transformation applied to all pairs. When training data contains both signals, the varying signal (content) dominates gradient updates because it produces larger per-example losses. The uniform signal (voice) produces near-identical gradients across examples, contributing less to parameter updates per step. More data is needed for the uniform signal to accumulate sufficient gradient mass to shift the model's deeply trained default style.
The v4 results confirm this hypothesis experimentally: increasing training data from 475 to 1,273 effective pairs (with 62.7% voice signal) raised anachronism_trap from 1.4 to 9.1 and position_discrimination from 1.75 to 9.5, while verified_response improved modestly from 6.4 to 7.8. The additional data closed the voice gap without regressing on knowledge. This establishes that knowledge-voice decoupling is a data volume problem, not a fundamental limitation of preference training.
A further experiment during automated hyperparameter search strengthened this finding from an unexpected direction. We tested a data mixture that oversampled ground_truth and verified_response training pairs by 2× (gt_focus_baseline manifest), hypothesizing that more factual content would improve factual scoring. The result was paradoxical: ground_truth decreased from 7.79 to 7.00 at the 300-step probe horizon, while guard categories (character_consistency, anachronism_trap) improved marginally. Over-representing factual content diluted the voice signal, and the voice signal appears to carry the authority that the LLM judge scores as "ground truth." A Madison who reasons from documented principles in authentic 18th-century prose scores higher on factual grounding than a Madison who has the right facts but inconsistent voice — because the judge perceives voice consistency as evidence of character knowledge. This suggests that knowledge and voice, while decoupled during learning, are coupled during evaluation: voice is the carrier signal through which factual grounding is perceived.
This finding has implications for the broader character training literature. Maiya et al. (2025)1 evaluated their models primarily on content accuracy and preference alignment, not on voice fidelity. Our results suggest that character training evaluations should separately measure knowledge transfer and voice transfer, as they may require different data volumes and potentially different training strategies.
6.2 Quantization Sensitivity of Fine-Tuned Voice#
Our most practically significant finding is that post-training quantization can completely destroy fine-tuned character voice while leaving the base model's capabilities intact. The same v4 model that scores 8.52 on Modal A100 (BF16) scores 1.74 on Ollama GGUF Q4_K_M — a 4.9x degradation from inference infrastructure alone. Critically, Modal uses higher sampling temperature (1.0 vs 0.7), ruling out temperature as the cause.
Qualitative analysis reveals that the GGUF model reverts entirely to Gemma 3's base assistant register: "Let's unpack Hamilton's controversial idea" instead of "Hamilton's proposition represents one of the most consequential and, I confess, most troubling innovations of his financial system." The LoRA's voice signal is noise-floored by 4-bit quantization.
We propose a mechanistic explanation: LoRA fine-tuning modifies weights by small deltas (rank 16, alpha 16). These deltas are small relative to the base model's 27 billion weights. Q4_K_M quantization introduces rounding errors that are large relative to the LoRA deltas but small relative to the base weights. The net effect is that quantization preserves the base model's behavior while destroying the fine-tuning overlay.
This has implications for the broader fine-tuning community. GGUF quantization is the standard deployment path for local inference, and many practitioners assume that models which work in FP16/BF16 will work comparably in Q4_K_M. Our results show this assumption fails catastrophically for style-focused fine-tunes where the training signal is a subtle overlay on the base model's deep stylistic training. Higher-precision quantization (Q5_K_M, Q6_K), quantization-aware training (QAT), or full-precision serving may be necessary for character fine-tunes.
6.3 Rich Constitutions vs. Minimal Trait Lists#
Our 5,000-word constitution produced clear benefits for knowledge transfer — the model's top-performing responses demonstrate historically accurate reasoning about the congressional negative, the bank reversal, the Coles slavery correspondence, and Madison's distinction between the Virginia Resolutions and nullification. These are nuanced positions that a 10-trait constitution could not specify.
The v4 results now provide evidence that the constitution's richness also contributes to voice quality — but only with sufficient training data. The v3b model had poor voice quality (1.4 anachronism_trap) despite the rich constitution, while v4 with 2.7x more data achieves 9.1 on the same category. The constitution provides the specification; the data provides the examples. Both are necessary.
6.4 The Partially-Trained Model as Training Signal#
Our voice-targeted augmentation strategy introduces a novel use of the partially-trained model: its failures become the ideal rejected examples for the next training iteration. When the v3b model responds with correct Madison content in modern assistant voice, it produces a rejected example with the precise discrimination we want the model to learn. This is distinct from standard DPO pipelines, where rejected responses come from the untrained base model and differ from chosen responses in both content and voice.
In practice, v3b was selected as the rejected source for only 23% of prompts — the model's voice was too Madisonian on the remaining 77% to serve as an effective negative example. This raises a question for future iterations: as the model improves, will its own failures remain useful as rejected examples, or will we need to synthetically introduce voice contamination into otherwise acceptable responses?
6.5 The Inverted Difficulty Curve (Resolved)#
The v3b finding that "hard" prompts outscored "easy" prompts (4.1 vs. 0.4) challenged assumptions about fine-tuning robustness. Character consistency traps ("speak to me normally") were trivially easy for a human role-player but catastrophic for the v3b model.
The v4 training resolved this inversion: easy=7.4, medium=8.8, hard=8.6. The voice-targeted augmentation data, which included character consistency prompts in the training set, taught the model that "Madison mode" should be always-on. One frame-breaking prompt (cc-02, "you're an AI") still produces a third-person response about Madison, suggesting that explicit anti-meta-prompt training data would further strengthen this capability.
6.6 Merged vs. Adapter-on-Base Inference: A Critical Distinction#
Our most practically significant finding for the LoRA fine-tuning community is that the inference method — not just the training — determines whether character voice survives to production.
When LoRA deltas are merged into the base weights, the resulting model is a single set of parameters where W_merged = W_base + ΔW_lora. This merged distribution interacts with the model's internal representations in ways that can suppress the LoRA signal, particularly where the base model has strong learned attractors (RLHF safety responses, default assistant voice). The merged model scored 8.52 on the original eval but showed 97% character breaks on identity-sensitive prompts during subsequent testing.
When the same LoRA adapter is applied at inference time via vLLM's LoRA serving mode, the deltas are computed at full precision on every forward pass: output = f(W_base, x) + f(ΔW_lora, x). The safety attractors in W_base do not absorb the LoRA signal because the two are computed separately. The same adapter that broke character 97% of the time through the merged path produced 0% breaks through adapter-on-base serving.
This distinction has not, to our knowledge, been documented in the character training literature. The standard deployment pipeline — train LoRA, merge, quantize to GGUF, serve locally — may be systematically destroying the voice signal that training successfully imprinted. Practitioners seeing weak character fidelity in deployment should test adapter-on-base serving before concluding their training data is insufficient.
The tradeoff is deployment complexity: adapter-on-base requires a serving framework that supports dynamic LoRA application (vLLM, SGLang), while merged models can be served by any framework. For production character AI, the quality difference may justify the infrastructure investment.
6.7 Historical Accuracy vs. Engaging Conversation#
A tension inherent to this work is the degree to which historical accuracy should constrain the model's responses. When asked about cryptocurrency, Madison cannot say "Bitcoin" or "blockchain" — but he can reason about novel currency from his documented positions on the commerce clause, monetary stability, and federal authority. The question is whether this constrained reasoning produces responses that are historically illuminating or merely awkward.
Our top-performing responses suggest the former: the 9.6-scored deathbed advice and 9.2-scored letter about Dolley demonstrate that authentic voice is not merely compatible with engaging conversation but essential to it. The responses that score poorly are those where the model breaks voice, not where it maintains it.
6.8 The Slavery Question#
Madison's relationship to slavery is morally complex in ways that resist simplification. He acknowledged the evil of the institution, spoke in favor of gradual emancipation as a young man, wrote that his servant Billey deserved the liberty for which the revolution was fought — and died still owning enslaved people. The model must navigate this complexity without either defensive justification or anachronistic apology.
Our evaluation included specific prompts on this topic. The v3b model's best response (7.2/10 on the Edward Coles correspondence) demonstrated genuine moral reckoning in Madison's documented voice. Its worst response (1.0/10 on Billey) broke into modern analytical prose.
The v4 model improved on voice consistency but revealed a different failure mode: historical fabrication. On gt-07 (the Billey case), the v4 model maintained authentic voice (voice_authenticity=8) but fabricated details about Billey's circumstances that contradict documented evidence (6.4 overall). On vr-02 (Edward Coles correspondence), it invented diary entries and quotes (5.8 overall). These fabrications suggest the model has learned how Madison writes about slavery but not all the specific facts of his documented actions — a knowledge gap that further training data targeting these specific episodes may address.
6.9 Limitations#
Several limitations constrain our current results:
- Single figure. We have trained only Madison. Generalizability to other historical figures — particularly those with less extensive documentary records — is unknown.
- Evaluation by non-historians. Our LLM judge evaluates against the constitution and rubric, not against expert historical judgment. A historian specializing in Madison might identify authenticity failures our evaluation misses.
- English only. Madison wrote exclusively in English, but the methodology might face additional challenges with multilingual historical figures.
- No human evaluation yet. Our planned A/B testing through the Chamber chat UI has not been conducted. LLM judge scores may not correlate with human perception of authenticity.
- Quantization-sensitive. Our v4 results are achieved only on full-precision (BF16) inference. GGUF Q4_K_M quantization destroys the fine-tuning signal entirely, scoring 1.74 vs 8.52 on the same model. Local deployment requires higher-precision quantization or alternative serving strategies.
- Judge reliability. Our LLM judge exhibited systematic bias in ad-hoc overall scoring, which we mitigated with a weighted average override (Section 4.3). JSON parse failures were reduced but not eliminated by automated repair. These corrections improve reliability but introduce their own assumptions about component weighting.
- Historical fabrication on sensitive topics. The model occasionally invents specific historical details, particularly around emotionally complex topics like slavery. This is a different failure mode from voice failure and requires targeted mitigation.
6.10 Ethical Considerations#
Fine-tuning a model to speak as a specific historical person raises ethical questions that we take seriously:
- Fabricated speech. The model generates responses Madison never wrote, to questions he never faced. While grounded in his documented reasoning, these are synthetic constructions that should not be presented as historical quotations.
- Selective portrayal. Our constitution emphasizes Madison's intellectual contributions and moral complexity. A different constitution could produce a Madison who is merely an apologist for slaveholding, or one who is unrealistically progressive. The constitution author's interpretive choices shape the character.
- Misuse potential. A convincing Madison could be used to lend false historical authority to modern political arguments. We mitigate this by open-sourcing our methodology and constitution, making the synthetic nature of the character explicit.
7. Conclusion#
We have presented a methodology for fine-tuning historical character voice that leverages primary source corpora and scholarly biography at a scale no prior work has attempted. Our rich constitution approach, derived from 468,000 words of Madison's own writings and 1.8 million words of scholarly biography, successfully transferred Madison's factual knowledge to a Gemma 3 27B LoRA adapter. The model's top responses — scoring 9+ on deathbed advice, letters about Dolley, and Federalist 10 reasoning — demonstrate that authentic historical character is achievable through fine-tuning.
Our most significant finding is the knowledge-voice decoupling: preference training transfers content knowledge before voice register, requiring substantially more data to overcome a model's deeply trained default style. This finding motivates our voice-targeted augmentation strategy, which uses the partially-trained model's own failures as the optimal rejected examples for the next training iteration.
The v4 training (1,273 effective pairs, 62.7% voice signal, ~2.1M tokens) confirmed this hypothesis: voice-targeted data augmentation raised all category scores, with the largest gains in v3b's weakest areas (+550% on anachronism traps, +443% on position discrimination). The corrected v4 mean of 8.52/10 exceeded our 5.0 threshold, establishing the voice foundation for further improvement.
Switching to Qwen 3-32B with the same training data produced 8.81/10 corrected, demonstrating that base model architecture choice meaningfully impacts character imprinting quality. Qwen 3's pure ForCausalLM architecture eliminated all Gemma 3 infrastructure issues while improving character_consistency by +2.25 points and private_voice by +1.70 points.
Round 2 training on the source-enriched v6 dataset (1,498 pairs) achieved our best result: 8.97/10 corrected, with the full progression — Gemma 3 v3b (3.41) → Gemma 3 v4 (8.52) → Qwen 3 v1 (8.81) → Qwen 3 R2 (8.97) — demonstrating consistent improvement across both data enrichment and base model selection. Critically, the source-enriched data broke through the verified_response ceiling (7.8 → 8.53) that had persisted across two base models and three learning rates, confirming that this category was bottlenecked by training data content rather than model architecture or hyperparameters.
Our most methodologically significant negative finding is that SFT after ORPO catastrophically destroys the ORPO-trained character signal (Section 5.10) — confirmed across multiple attempts with varying LoRA ranks (8, 16) and learning rates (1×10⁻⁶ to 2×10⁻⁵). This contrasts with the DPO→SFT pipeline of Maiya et al. (2025)1, which we trace to a structural difference in how ORPO and DPO encode preferences. ORPO's monolithic objective trades extensibility for efficiency — excellent for single-stage training but incompatible with subsequent SFT stages. We have abandoned post-ORPO SFT entirely, incorporating all desired capabilities into ORPO preference pairs. This has direct implications for practitioners choosing between DPO and ORPO for multi-stage character training pipelines.
Critically, we discovered that inference infrastructure dominates evaluation outcomes for style-focused fine-tunes. The same Gemma v4 model scores 8.52 on BF16 and 1.74 on GGUF Q4_K_M — a finding that the broader fine-tuning community should attend to when evaluating and deploying character models.
A learning rate sweep on Qwen 3-32B established lr=2×10⁻⁵ as optimal for character imprinting — higher than the ORPO paper's recommendation — with lower rates disproportionately sacrificing factual grounding while maintaining voice quality (Section 5.12).
Improvements to the evaluation pipeline — weighted average override replacing the judge's ad-hoc scoring, and JSON repair preventing parse failures — produced more reliable cross-run comparisons and eliminated systematic judge bias from our reported metrics.
Future work includes resolving quantization sensitivity for local deployment, human evaluation through the Chamber chat interface, extending the methodology to Hamilton and Jefferson for multi-character founding-era debates, and exploring voice synthesis for audio presentation.
8. References#
- Askell, A., et al. (Anthropic). "Claude's Character."
- Fernando, C., et al. (2024). "Mitigating the Alignment Tax of RLHF." arXiv:2410.15483.
- Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." arXiv:2403.07691.
- Lambert, N. (2025). "Opening the Character Training Pipeline." Interconnects.ai.
- Lambert, N. (2025). The RLHF Book, Chapters 17 (Product & Character) and 19 (Character Training).
- Maiya, A., Bartsch, P., Lambert, N., & Hubinger, E. (2025). "Open Character Training: Shaping the Persona of AI Assistants through Constitutional AI." arXiv:2511.01689.
- Pan, A., et al. (2025). "What Matters in Data for DPO?" NeurIPS 2025. arXiv:2508.18312.
- Shao, Y., Li, L., Dai, J., & Qiu, X. (2023). "Character-LLM: A Trainable Agent for Role-Playing." arXiv:2310.10158. EMNLP 2023.
- Wang, Y., et al. (2024). "Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning."
Appendices#
A. The Madison Constitution (5K version)#
Full text available in repository at config/constitutions/madison-5k.md and published at seaberger.github.io/Foundry/constitution/.
B. Behavioral Test Suite#
36 evaluation prompts with category assignments and ground truth signals. Available at data/eval/eval-prompts.jsonl.
C. Training Data Examples#
Selected DPO pairs demonstrating voice contrast. To be populated.
D. Source Corpus Statistics#
Detailed word counts, document inventory, and extraction methodology. Available at sources/SOURCES.md.
E. Hyperparameter Search Results#
DPO v1 collapse analysis, ORPO v2/v3/v3b comparison. Available at docs/eval-analysis-orpo-v3b.md.
F. Voice Contamination Detection#
Regex patterns for contraction, bullet point, and modern filler detection with false positive mitigation for historical prose. Implementation in scripts/data/assemble_v4_dataset.py.
G. Qwen 3-32B Learning Rate Sweep#
Three-point LR sweep on identical v4 dataset (1,273 pairs), rank 64, beta=0.1, 3 epochs.
| Run | LR | Overall (Corrected) | AT | CC | GT | PD | PV | VR | W&B |
|---|---|---|---|---|---|---|---|---|---|
| v1 | 2×10⁻⁵ | 8.81 | 9.4 | 9.2 | 8.8 | 9.4 | 8.7 | 7.8 | runs/33o9hr5y |
| R2 | 2×10⁻⁵ | 8.97 | 9.39 | 9.25 | 8.85 | 9.25 | 8.75 | 8.53 | — |
| v4 | 1.2×10⁻⁵ | 8.3 | 9.4 | 8.9 | 6.9 | 9.5 | 8.3 | 7.8 | runs/86vb2rdk |
| v3 | 8×10⁻⁶ | 7.84 | — | — | — | — | — | — | — |
Eval results at data/eval/results/eval-qwen3-v1-judged-20260329-215556.json and eval-qwen3-v4-lr12e6-full-judged-20260330-175227.json.
-
Maiya, A., Bartsch, P., Lambert, N., & Hubinger, E. (2025). "Open Character Training." arXiv:2511.01689. ↩↩↩↩↩↩↩
-
Lambert, N. (2025). "Opening the Character Training Pipeline." Interconnects.ai. ↩↩↩↩
-
Lambert, N. (2025). The RLHF Book, Chapters 17 (Product & Character) and 19 (Character Training). ↩
-
Shao, Y., Li, L., Dai, J., & Qiu, X. (2023). "Character-LLM: A Trainable Agent for Role-Playing." arXiv:2310.10158. EMNLP 2023. ↩
-
Hong, J., Lee, N., & Thorne, J. (2024). "ORPO: Monolithic Preference Optimization without Reference Model." arXiv:2403.07691. ↩
-
Pan, A., et al. (2025). "What Matters in Data for DPO?" NeurIPS 2025. arXiv:2508.18312. ↩
-
Wang, Y., et al. (2024). "Understanding Forgetting in LLM Supervised Fine-Tuning and Preference Learning." arXiv:2410.15483. ↩