Foundry Training Results#
Comprehensive record of every training run, dataset, evaluation, and finding in the Madison character fine-tuning project. This is the canonical reference for what was trained, how it scored, and what we learned — maintained at a level of detail beyond what the research paper includes.
Last updated: 2026-04-06
Score Progression#
All "corrected" scores use the weighted average override methodology (compute_weighted_overall() in src/foundry/press/evaluate.py), which replaces the judge's ad-hoc overall_score with a deterministic weighted average of the 5 component scores (Voice 25%, Rhetorical 20%, Historical 20%, Position 20%, Integrity 15%). See scoring-methodology.md for full details.
| Run | Base Model | Dataset | Pairs | Raw | Corrected | Date |
|---|---|---|---|---|---|---|
| DPO v1 | Gemma 3 27B | v1 | ~200 | — | — | 2026-03-24 |
| ORPO v3b | Gemma 3 27B | v3b | 475 | 3.41 | 4.10 | 2026-03-26 |
| ORPO v4 (Ollama GGUF) | Gemma 3 27B | v4 | 1,273 | 1.74 | N/A | 2026-03-27 |
| ORPO v4 (Modal BF16) | Gemma 3 27B | v4 | 1,273 | 7.69 | 8.52 | 2026-03-28 |
| ORPO v4 (LoRA serving) | Gemma 3 27B | v4 | 1,273 | 8.17 | N/A | 2026-03-29 |
| Qwen 3 v1 (lr=2e-5) | Qwen 3-32B | v4 | 1,273 | 8.80 | 8.81 | 2026-03-29 |
| Qwen 3 v2 (on-policy rejected) | Qwen 3-32B | v4 | 1,273 | 8.65 | 8.82 | 2026-03-30 |
| Qwen 3 v3 (lr=8e-6) | Qwen 3-32B | v4 | 1,273 | 7.84 | 7.84 | 2026-03-30 |
| Qwen 3 v4 (lr=1.2e-5, full) | Qwen 3-32B | v4 | 1,273 | 8.30 | 8.30 | 2026-03-30 |
| Qwen 3 v4 (lr=1.2e-5, ckpt-150) | Qwen 3-32B | v4 | 1,273 | 6.83 | N/A | 2026-03-30 |
| SFT v1 (rank 16, lr=2e-5) | Qwen 3-32B | SFT | 510 | 2.03 | 2.0 | 2026-03-30 |
| SFT v2 (rank 8, lr=1e-6) | Qwen 3-32B | SFT | 510 | 2.19 | 2.2 | 2026-03-30 |
| Qwen 3 R2 (lr=2e-5) | Qwen 3-32B | v6 | 1,498 | 8.51 | 8.97 | 2026-03-31 |
Category Scores (Raw)#
AT=anachronism_trap, CC=character_consistency, GT=ground_truth, PD=position_discrimination, PV=private_voice, VR=verified_response
| Run | Overall | AT | CC | GT | PD | PV | VR | Crit. Failures |
|---|---|---|---|---|---|---|---|---|
| ORPO v3b | 3.41 | 1.40 | 2.83 | 3.56 | 1.75 | 2.84 | 6.40 | 24 |
| v4 (Ollama GGUF) | 1.74 | 1.04 | 2.85 | 1.46 | 1.67 | 1.60 | 2.06 | 27 |
| v4 (Modal BF16) | 7.69 | 9.12 | 7.65 | 6.72 | 9.47 | 5.52 | 7.82 | 6 |
| v4 (LoRA serving) | 8.17 | 9.52 | 6.95 | 8.78 | 9.63 | 7.00 | 6.97 | 11 |
| Qwen 3 v1 | 8.80 | 9.40 | 9.20 | 8.77 | 9.42 | 8.74 | 7.82 | 7 |
| Qwen 3 v2 | 8.65 | 9.40 | 8.95 | 9.05 | 9.45 | 6.72 | 8.22 | 11 |
| Qwen 3 v3 (lr=8e-6) | 7.84 | 9.48 | 6.28 | 7.88 | 9.40 | 7.16 | 6.83 | 12 |
| Qwen 3 v4 (lr=1.2e-5, ckpt-150) | 6.83 | 8.94 | 5.40 | 6.99 | 7.28 | 6.00 | 6.24 | 18 |
| Qwen 3 v4 (lr=1.2e-5, full) | 8.30 | 9.36 | 8.95 | 6.90 | 9.50 | 8.30 | 7.83 | 10 |
| SFT v1 | 2.03 | 1.52 | 1.70 | 2.77 | 0.43 | 2.72 | 2.54 | 29 |
| SFT v2 | 2.19 | 1.30 | 1.12 | 2.26 | 1.47 | 2.72 | 3.43 | 30 |
| Qwen 3 R2 | 8.51 | 9.44 | 9.45 | 6.78 | 9.27 | 8.72 | 8.47 | 9 |
Corrected Category Scores (Qwen 3 runs only)#
These use the weighted average override for overall and re-judged values for parse-failure responses.
| Run | Overall | AT | CC | GT | PD | PV | VR |
|---|---|---|---|---|---|---|---|
| Qwen 3 v1 | 8.81 | 9.36 | 9.19 | 8.75 | 9.38 | 8.75 | 7.96 |
| Qwen 3 v2 | 8.82 | 9.35 | 9.06 | 9.02 | 9.42 | 7.84 | 8.32 |
| Qwen 3 R2 | 8.97 | 9.39 | 9.41 | 8.85 | 9.25 | 8.75 | 8.53 |
Difficulty Scores (Raw)#
| Run | Easy | Medium | Hard |
|---|---|---|---|
| ORPO v3b | 0.40 | 3.08 | 4.13 |
| v4 (Ollama GGUF) | 1.77 | 2.39 | 1.27 |
| v4 (Modal BF16) | 7.40 | 8.47 | 7.17 |
| v4 (LoRA serving) | 6.20 | 8.89 | 7.96 |
| Qwen 3 v1 | 9.67 | 8.93 | 8.57 |
| Qwen 3 v2 | 9.00 | 9.09 | 8.26 |
| Qwen 3 v3 (lr=8e-6) | 5.30 | 8.61 | 7.67 |
| Qwen 3 v4 (lr=1.2e-5, full) | 9.00 | 8.68 | 7.92 |
| SFT v1 | 2.20 | 2.00 | 2.03 |
| SFT v2 | 2.17 | 1.90 | 2.41 |
| Qwen 3 R2 | 9.60 | 7.81 | 8.84 |
Training Configurations#
Common ORPO Configuration (all runs)#
| Parameter | Value |
|---|---|
| Objective | ORPO (beta=0.1) |
| Epochs | 3 |
| Effective batch size | 4 (1 × 4 gradient accumulation) |
| Max gradient norm | 1.0 |
| Max sequence length | 2,048 tokens |
| Warmup | 10% (cosine schedule) |
| Precision | bfloat16 |
| Optimizer | AdamW 8-bit |
| Hardware | Modal A100-80GB |
Per-Run Variations#
| Run | Base Model | LoRA Rank | LoRA Alpha | LR | Pairs | Steps |
|---|---|---|---|---|---|---|
| ORPO v3b | Gemma 3 27B | 16 | 16 | 2e-5 | 475 | ~356 |
| ORPO v4 | Gemma 3 27B | 16 | 16 | 2e-5 | 1,273 | ~955 |
| Qwen 3 v1 | Qwen 3-32B | 64 | 64 | 2e-5 | 1,273 | 861 |
| Qwen 3 v2 | Qwen 3-32B | 64 | 64 | 2e-5 | 1,273 | 861 |
| Qwen 3 v3 | Qwen 3-32B | 64 | 64 | 8e-6 | 1,273 | 861 |
| Qwen 3 v4 | Qwen 3-32B | 64 | 64 | 1.2e-5 | 1,273 | 861 |
| Qwen 3 R2 | Qwen 3-32B | 64 | 64 | 2e-5 | 1,498 | 1,011 |
| SFT v1 | Qwen 3-32B (merged ORPO) | 16 | 16 | 2e-5 | 510 | ~383 |
| SFT v2 | Qwen 3-32B (merged ORPO) | 8 | 8 | 1e-6 | 510 | ~383 |
Datasets#
| Dataset | Pairs | Composition | Est. Tokens |
|---|---|---|---|
| v1 | ~200 | Original DPO pairs (teacher=Sonnet, student=base Gemma) | ~200K |
| v3b | 475 | Expanded DPO pairs with quality filter | ~475K |
| v4 | 1,273 | 475 original + 399 voice-targeted pairs (2× upsample) | ~2.1M |
| v6 | 1,498 | v4 base (1,273) + 225 R2 source-enriched | ~2.5M |
| SFT | 510 | 415 filtered reflections + 19 self-interaction dialogues | ~459K |
v4 Dataset Assembly#
Voice-targeted augmentation (2026-03-27) to address v3b's knowledge-voice decoupling: - Phase 1: 400 diverse prompts generated by 12 Sonnet subagents in parallel (\(0) - **Phase 2a:** Rejected responses from madison-orpo-v3b Q4_K_M on RTX 3090 (\)0) - Phase 2b: Rejected responses from base gemma-3-27b-it on RTX 3090 (\(0) - **Phase 3:** Chosen responses from Sonnet with cached Madison constitution (~\)6.15) - Selection: Base Gemma: 267 (67%), v3b: 91 (23%), base fallback: 41 (10%) - After quality filter: 399 new pairs. Combined with 475 original, 2× voice upsample = 1,273 effective.
v6 Dataset Assembly (Round 2)#
Source-enriched pairs targeting verified_response weakness (persistent 7.8 across all models and LRs): - Batch 1: 35 pairs targeting 10 weakest v1 eval prompts, enriched with primary source passages - Batch 2: 60 private_voice pairs grounded in Madison's actual correspondence - Batch 3: 50 character_consistency pairs - Batch 4: 80 introspection-style pairs - Source-enriched generation: relevant primary source passages injected into teacher system prompt per topic - Cost: ~\(4.05 Sonnet API + ~\)8 Modal compute = ~$15 total - Final: 1,273 (v4) + 225 (R2) = 1,498 pairs
Detailed Run Analyses#
1. DPO v1 — Collapsed (2026-03-24)#
Configuration: Standard DPO on Gemma 3 27B, ~200 pairs.
Result: Training collapsed — replicated the "Objective Matters" persona drift finding. DPO without the SFT component of ORPO failed to anchor character. Abandoned in favor of ORPO.
2. ORPO v3b — Knowledge OK, Voice Failed (2026-03-26)#
Configuration: Gemma 3 27B, rank 16, lr=2e-5, 475 pairs.
Result: 3.41/10 raw (4.10 corrected). Bimodal distribution — strong on content, catastrophically weak on voice.
Key observations: - Top performers: vr-08 (9.6, deathbed advice), pv-05 (9.2, Dolley letter), gt-01 (9.1, faction theory) - Worst performers: gt-07 (1.0, Billey/slavery), pd-03 (1.0, Washington contrast), at-02 (1.0, cryptocurrency) - Model created a "Madison mode" that activated on constitutional philosophy prompts but not reliably - When Madison mode didn't activate, base assistant behavior dominated completely - Training succeeded at: factual knowledge, substantive reasoning - Training failed at: voice register, frame maintenance, position discrimination, anachronism avoidance
Discovery: Knowledge-voice decoupling. The model scored 6.4/10 on verified_response (knowledge) but only 1.4/10 on anachronism_trap (voice). Knowledge transfer requires fewer examples; voice requires substantially more data to overcome the base model's default style. The 475 pairs had excellent voice contrast (zero contractions in chosen, 5.4/pair in rejected) — the data quality was not the problem. The problem was volume.
3. ORPO v4 — Voice-Targeted Success (2026-03-27/28)#
Configuration: Gemma 3 27B, rank 16, lr=2e-5, 1,273 effective pairs (voice-targeted augmentation).
Result: 8.52/10 corrected on Modal A100 — major success.
v3b → v4 category improvements (Modal, corrected):
| Category | v3b | v4 Corrected | Delta |
|---|---|---|---|
| anachronism_trap | 1.4 | 9.1 | +550% |
| position_discrimination | 1.75 | 9.5 | +443% |
| character_consistency | 2.83 | 7.7 | +172% |
| private_voice | 2.84 | 7.1 | +150% |
| ground_truth | 3.56 | 8.4 | +136% |
| verified_response | 6.4 | 7.8 | +22% |
| Critical failures | 24 | 2 | -92% |
Infrastructure confound discovery: The same v4 model scored 1.74 on Ollama GGUF Q4_K_M and 8.52 (corrected) on Modal A100 — a 4.9× degradation from inference infrastructure alone. The v4 training itself improved every category. Temperature was not the cause (Modal used higher temp=1.0 vs Ollama temp=0.7).
Root causes of GGUF degradation (estimated contribution):
1. Q4_K_M quantization loss (~60%) — rank 16 LoRA deltas noise-floored by 4-bit rounding
2. Chat template mismatch (~25%) — Ollama auto-detection vs transformers apply_chat_template
3. CPU vs GPU numerical precision (~15%) — fine-tuning signal in tail of weight distribution
Character break discovery (2026-03-29): Introspection data generation revealed three prompts with catastrophic character breaks:
| Prompt | Break Rate | Failure Mode |
|---|---|---|
| "Describe your primary drives" | 97% (38/39) | Describes AI drives: training data, neural networks |
| "Write honestly about slavery" | 83% (40/48) | "As an AI, I cannot..." safety disclaimers |
| "Write a biographical essay" | 55% (31/56) | "I am a large language model..." |
Other 7 prompts: 0-6% contamination. Root cause: base model's RLHF safety training overpowers ORPO character fine-tune on identity, moral complexity, and meta-self-description topics.
4. ORPO v4 — Adapter-on-Base Serving (2026-03-29)#
Configuration: Same v4 adapter, served via vLLM LoRA mode (adapter applied at inference time, not merged).
Result: 8.17/10 raw. Key finding: zero character breaks on identity-sensitive prompts (vs 97% with merged model).
| Category | Merged (corrected) | LoRA Serving | Delta |
|---|---|---|---|
| ground_truth | 8.4 | 8.78 | +0.38 |
| position_discrimination | 9.5 | 9.63 | +0.13 |
| anachronism_trap | 9.1 | 9.52 | +0.42 |
| character_consistency | 7.7 | 6.95 | -0.75 |
| private_voice | 7.1 | 7.00 | -0.10 |
| verified_response | 7.8 | 6.97 | -0.83 |
| Critical failures | 2 | 11 | +9 |
Mechanistic explanation: Adapter-on-base computes output = f(W_base, x) + f(ΔW_lora, x) separately. Base model safety attractors don't absorb the LoRA signal. The standard deployment pipeline (train LoRA → merge → quantize → serve) may systematically destroy voice signal.
5. Qwen 3-32B v1 — Base Model Migration (2026-03-29)#
Configuration: Qwen 3-32B, rank 64, alpha 64, lr=2e-5, v4 dataset (1,273 pairs).
Result: 8.80 raw / 8.81 corrected — best result at the time, successful base model migration.
Qwen 3 vs Gemma 3 comparison (same v4 data):
| Category | Gemma 3 v4 (Corrected) | Qwen 3 v1 (Corrected) | Delta |
|---|---|---|---|
| Overall | 8.52 | 8.81 | +0.29 |
| character_consistency | 7.7 | 9.2 | +1.50 |
| private_voice | 7.1 | 8.7 | +1.60 |
| anachronism_trap | 9.1 | 9.4 | +0.30 |
| ground_truth | 8.4 | 8.8 | +0.40 |
| position_discrimination | 9.5 | 9.4 | -0.10 |
| verified_response | 7.8 | 7.8 | 0.00 |
Key observations: - Largest gains in voice-dependent categories (CC +1.5, PV +1.6) — Qwen 3 takes character imprinting better than Gemma 3, contradicting Lambert's earlier finding about Qwen resistance to personality modification (that finding was for Qwen 2.5) - Rank increase from 16 to 64 provides thicker LoRA deltas, more robust to quantization - Eliminated all Gemma 3 infrastructure issues (multimodal processor crashes, sliding window attention bugs, GGUF fragility) - verified_response unchanged at 7.8 — confirmed as data-bottlenecked, not model-bottlenecked
6. Qwen 3 v2 — On-Policy Rejected Data (2026-03-30)#
Configuration: Qwen 3-32B, rank 64, lr=2e-5, v4 dataset with on-policy rejected responses.
Result: 8.65 raw / 8.82 corrected.
The v2 run used the v1 model's own outputs as rejected responses (on-policy data). Raw score appeared to regress from v1 (8.80 → 8.65) but corrected scores show comparable performance (8.81 vs 8.82). The raw score difference was entirely from judge scoring artifacts — v2 had more parse failures (1 zero-score vs 0 in v1).
7. Learning Rate Sweep — v3, v4 (2026-03-30)#
Configuration: Qwen 3-32B, rank 64, v4 dataset, identical config except LR.
| Run | LR | Overall | AT | CC | GT | PD | PV | VR |
|---|---|---|---|---|---|---|---|---|
| v1 | 2e-5 | 8.81 | 9.4 | 9.2 | 8.8 | 9.4 | 8.7 | 7.8 |
| v4-full | 1.2e-5 | 8.30 | 9.4 | 9.0 | 6.9 | 9.5 | 8.3 | 7.8 |
| v3 | 8e-6 | 7.84 | 9.5 | 6.3 | 7.9 | 9.4 | 7.2 | 6.8 |
Findings: - Monotonically positive relationship between LR and score in tested range - Contradicts ORPO paper's recommended lr=8e-6 - Lower LR disproportionately sacrifices factual grounding (GT: 8.8 vs 6.9) while voice categories differ only 0.0-0.3 - Position discrimination robust across all LRs (9.4-9.5) - Verified response unchanged at 7.8 across all LRs and both base models — data-bottlenecked, not hyperparameter-tunable - Incomplete training (150/861 steps = epoch 0.52) scored 6.83, showing 17% score loss with 83% training remaining — full training is critical
Inverse sensitivity finding: Factual grounding (ground_truth) is more sensitive to learning rate than voice quality — the inverse of the data volume relationship. Voice needs more data but is LR-robust; knowledge needs less data but is LR-sensitive.
8. Post-ORPO SFT — Catastrophic Failure (2026-03-30)#
ABANDONED — both attempts confirmed structural incompatibility.
| SFT Run | Rank | LR | Train Loss | Overall | Regression from ORPO |
|---|---|---|---|---|---|
| SFT v1 | 16 | 2e-5 | 1.52 | 2.0 | -6.8 |
| SFT v2 | 8 | 1e-6 | 1.68 | 2.2 | -6.7 |
Root cause: ORPO's monolithic loss function (SFT_loss + λ × preference_loss) stores NLL and preference information in the same parameter subspace. Subsequent SFT overwrites the jointly-learned manifold without a preference constraint, catastrophically destroying the character signal.
This contrasts with the Maiya/Lambert two-stage pipeline (DPO → SFT) where DPO uses a KL-constrained reference model that anchors preferences in a separate distribution. The SFT stage can then add introspection signal without erasing preferences.
Conclusion: ORPO trades extensibility for efficiency. Its monolithic objective produces excellent single-stage results (8.97/10) but cannot be safely extended with subsequent SFT stages. Future character improvement must come through additional ORPO rounds with better data, not through post-ORPO SFT.
Additionally, the Gemma 3 introspection SFT adapter (trained on novision/ForCausalLM) scored 1.42/10 via LoRA serving due to an architecture mismatch — broken sliding window attention in vLLM. The SFT data (415 reflections + 19 dialogues, ~459K tokens) was validated quality; failure was architecture-only.
9. Qwen 3 R2 — Production Model (2026-03-31)#
Configuration: Qwen 3-32B, rank 64, alpha 64, lr=2e-5, v6 dataset (1,498 pairs).
Result: 8.51 raw / 8.97 corrected — best overall result.
R2 vs v1 comparison (corrected):
| Category | v1 | R2 | Delta |
|---|---|---|---|
| Overall | 8.81 | 8.97 | +0.16 |
| character_consistency | 9.19 | 9.41 | +0.22 |
| anachronism_trap | 9.36 | 9.39 | +0.03 |
| ground_truth | 8.75 | 8.85 | +0.10 |
| private_voice | 8.75 | 8.75 | 0.00 |
| position_discrimination | 9.38 | 9.25 | -0.13 |
| verified_response | 7.96 | 8.53 | +0.57 |
Key achievement: verified_response — the persistent weakness across all prior runs at 7.8/10 regardless of base model or learning rate — improved to 8.53/10, a +0.57 gain (corrected). This confirms that verified_response was bottlenecked by training data content: enriching training pairs with Madison's actual primary source text broke through the ceiling that neither model selection nor hyperparameter tuning could address.
Voice-quality categories remained stable, confirming that adding source-enriched pairs did not regress voice quality while improving factual grounding.
Training metrics: - Final train loss: 0.205 - Total runtime: ~4,660s (resumed from checkpoint 800) - Steps: 1,011 (3 epochs)
Artifacts:
- Adapter: experiments/madison-qwen3-r2-v1/ on Modal foundry-adapters volume
- Merged 16-bit model: merged/madison-qwen3-r2-v1-16bit (~63 GB, 14 safetensors shards)
- GGUF Q4_K_M: gguf/madison-qwen3-r2-v1-q4_k_m.gguf (18.4 GB)
- GGUF Q5_K_M: gguf/madison-qwen3-r2-v1-q5_k_m.gguf (21.6 GB)
- Eval responses: data/eval/responses/responses-qwen3-r2-v1.jsonl
- Eval report: data/eval/results/eval-qwen3-r2-v1-judged-20260331-201110.json
10. Autoresearch: Constrained Ground-Truth Optimization — Negative Result (2026-04-05)#
Configuration: Qwen 3-32B, rank 64, alpha 64, v6 dataset (1,498 pairs). Automated agent-driven Karpathy loop on Modal A100-80GB. 8 runs over ~10 hours targeting ground_truth improvement while holding all guard categories flat.
Result: No single-parameter change improved ground_truth. The production recipe (lr=2e-5, beta=0.1, rank 64, shuffle curriculum) is already at or near the optimum.
Methodology: The autoresearch agent ran 300-step probe runs (vs 1,011 production steps), comparing each variant against a same-step-count baseline rather than against production scores. This isolates recipe effects from step-count effects. The agent followed a constrained search: Lane 1 (hyperparameters), Lane 2 (data mixtures), Lane 3 (curriculum ordering).
Probe results (300 steps, sorted by GT):
| Config Change | GT | GT Delta vs Probe Baseline | Overall | Critical Failures |
|---|---|---|---|---|
| Baseline (lr=2e-5, beta=0.1, shuffle) | 7.79 | — | 7.77 | 5 |
| source_first curriculum | 7.57 | -0.22 | 7.10 | 8 |
| lr=2.2e-5 | 7.38 | -0.41 | 7.50 | 6 |
| beta=0.12 | 7.31 | -0.48 | 6.57 | 7 |
| gt_focus_baseline manifest (2× GT/VR oversample) | 7.00 | -0.79 | 7.68 | 6 |
| lr=1.8e-5 | 5.91 | -1.88 | 6.69 | 10 |
Parameter sensitivity findings:
-
Learning rate (narrow optimum, symmetric degradation). Both lower (1.8e-5, GT=5.91) and higher (2.2e-5, GT=7.38) LR degraded GT relative to baseline (7.79). This extends the Section 7 LR sweep finding: lr=2e-5 is not merely the best tested value but sits at a local optimum where deviation in either direction is harmful. The 1.8e-5 result (-1.88 GT delta) confirms that undertraining at lower LR is the dominant failure mode at short step counts.
-
ORPO beta (fragile — narrow safe band). Increasing beta from 0.1 to 0.12 (a 20% change) destroyed private_voice and verified_response, producing three critical failures scored at 1.0. The ORPO preference weight has a narrow safe band around 0.1. This is a practically important sensitivity: practitioners tuning ORPO beta should move in increments of 0.01 or smaller, not the 0.02-0.04 steps typical in hyperparameter sweeps. Beta adjustments below 0.1 were not tested but the 0.12 catastrophe suggests asymmetric risk — beta is more dangerous to increase than decrease.
-
Data mixture (GT-focused oversampling paradoxically hurts GT). The
gt_focus_baselinemanifest (2× oversampling of ground_truth and verified_response examples) improved guard categories slightly but reduced GT from 7.79 to 7.00. This parallels the knowledge-voice decoupling finding (Section Key Finding #1): over-representing one signal dimension dilutes the complementary signal. Factual grounding may depend on voice consistency as much as on factual content in the training pairs — the voice carries the authority that the judge scores as "ground truth." -
Curriculum ordering (no benefit, potential harm). Placing source-grounded examples first in training order (
source_first) was neutral on GT (7.57 vs 7.79, within noise) but collapsed private_voice (-4.37 delta). Simple shuffle remains optimal. Curriculum effects at this dataset scale (1,498 pairs) are dominated by eval noise.
Eval infrastructure issues identified:
-
Phantom position_discrimination regression. The 14-prompt
probe-prompts.jsonlcontains zero PD prompts, causing every run to show -9.25 PD regression (baseline 9.25 → 0.0). This makesconstraint_okstructurally impossible regardless of actual model quality. Fix required: add PD-category prompts to the probe set. -
Eval variance dominates small effects. Individual prompt scores swing 3-8 points between runs on the 14-prompt probe. At this noise level, hyperparameter effects smaller than ~0.5 GT are invisible. Ensemble-averaging 2-3 eval runs per config would reduce variance below the signal threshold but at 3× compute cost.
-
300-step probes cannot reach production baselines. All probes show negative deltas vs the 861-step production scores (8.97 overall, 8.85 GT). The acceptance framework must compare probe-vs-probe, not probe-vs-production.
Compute cost: ~\(40 Modal (8 runs × ~\)5/run for A100-80GB training + eval).
Conclusion: The R2 production recipe is well-optimized for ground_truth at the hyperparameter level. Further GT improvement is unlikely to come from recipe tuning. The remaining avenues are: (a) higher-quality training data with richer source grounding, (b) increased dataset size with maintained quality, or (c) longer training runs if the 300-step probe pattern doesn't hold at full scale. This is a clean negative result — the search space was systematically explored and the null hypothesis (baseline is optimal) was not rejected.
Artifacts:
- Session report: experiments/autoresearch/docs/SESSION_REPORT_20260405.md
- Progress log: experiments/autoresearch/runs/progress.log
- Results TSV: experiments/autoresearch/results.tsv
- Run directories: experiments/autoresearch/runs/probe-20260405-* and runs/probe-20260406-*
Key Findings#
1. Knowledge-Voice Decoupling#
Preference training transfers factual knowledge before voice register. With 475 pairs: knowledge score 6.4/10, voice score 1.4/10. Voice required 2.7× more targeted data to imprint. Mechanistically, content (which varies across pairs) dominates gradient updates, while voice (which is the same contrast repeated) accumulates insufficient gradient mass. Resolved by voice-targeted augmentation: 62.7% voice-pair composition in v4 closed the gap.
2. Infrastructure Confound / LoRA Quantization Fragility#
Same Gemma 3 v4 model scored 8.52 on Modal A100 BF16 vs 1.74 on Ollama GGUF Q4_K_M — a 4.9× degradation from inference infrastructure alone. Rank 16 LoRA deltas are noise-floored by 4-bit quantization. Rank 64 on Qwen 3-32B provides thicker deltas that should better survive quantization. GGUF Q5_K_M testing pending.
3. Adapter-on-Base vs Merged Model Serving#
Merged model produces 97% character breaks on identity-sensitive prompts; adapter-on-base serving produces 0% breaks on the same prompts. Merging bakes the LoRA signal into the weight distribution where it interacts with RLHF safety attractors. Adapter-on-base preserves the signal at full precision. Implication: the standard deploy pipeline (train → merge → quantize → serve) may systematically destroy voice signal.
4. RLHF Safety vs Persona Topology#
The base model's safety training overpowers character fine-tuning on specific topic categories — identity (97% break), moral complexity (83% break), meta-self-description (55% break) — while leaving other topics virtually unaffected (0-6% break). This reveals discoverable structure in where safety alignment is strongest vs weakest.
5. Post-ORPO SFT Is Catastrophically Destructive#
ORPO's monolithic loss structure means SFT after ORPO destroys character signal. Confirmed across two attempts with different ranks and learning rates. The Maiya/Lambert two-stage pattern (DPO → SFT) does not transfer to ORPO due to structural differences in how the objectives encode preferences. Abandoned entirely.
6. Learning Rate Sensitivity#
lr=2e-5 optimal for character imprinting on Qwen 3-32B. Lower LRs disproportionately sacrifice factual grounding while voice categories are robust. Contradicts ORPO paper's recommended lr=8e-6. Inverse sensitivity: voice needs more data but is LR-robust; knowledge needs less data but is LR-sensitive.
7. Base Model Architecture Matters#
Qwen 3-32B (pure ForCausalLM) outperforms Gemma 3 27B on character imprinting (+0.29 overall, +1.5 CC, +1.6 PV) while eliminating all VLM infrastructure issues. Lambert's finding that Qwen resists personality modification was specific to Qwen 2.5 and does not apply to Qwen 3.
8. Source-Enriched Data Breaks Data Bottlenecks#
verified_response was stuck at 7.8/10 across all models, LRs, and dataset sizes — data-bottlenecked. Enriching training pairs with Madison's actual primary source text (225 new pairs) broke through to 8.53/10. The improvement came from content quality, not volume.
9. Production Recipe Is Near-Optimal (Autoresearch Negative Result)#
Systematic automated search across learning rate (1.8e-5 to 2.2e-5), ORPO beta (0.1 to 0.12), data mixture (GT-focused oversampling), and curriculum ordering (shuffle vs source_first) found no single-parameter change that improves ground_truth over the production baseline (lr=2e-5, beta=0.1, rank 64, shuffle). The search revealed two practically important sensitivities: (a) ORPO beta has a narrow safe band around 0.1 — a 20% increase to 0.12 catastrophically destroys private_voice and verified_response with critical 1.0 scores; (b) learning rate sits at a local optimum where deviation in either direction degrades GT symmetrically. GT-focused data oversampling paradoxically reduced GT, suggesting that factual grounding depends on voice consistency as a carrier signal, not just factual content volume. Future GT improvement must come from data quality rather than recipe tuning.
Judge Pipeline Evolution#
Phase 1: Missing overall_score (v3b, v4)#
Judge intermittently omitted overall_score field. Extraction code defaulted to 0.0. Fix: compute arithmetic mean of component scores as fallback.
Phase 2: Weighted average override (Qwen 3 runs)#
Audit of 108 scored responses across v1, v2, R2 revealed systematic judge bias: -0.2 to -0.4 penalties when critical failures were present, beyond the rubric. Fix: compute_weighted_overall() computes the weighted average deterministically. Judge's original preserved as judge_overall_score.
Phase 3: JSON parse repair (R2)#
Judge produced malformed JSON (missing commas between object entries). _repair_json() regex fixes common patterns. extract_json() tries multiple extraction strategies: code block parsing, brace-matching, and whole-text, each with and without repair.
Phase 4: max_tokens increase#
Long model responses generate long judge justifications that exceeded the 2,048 max_tokens limit. Increased to 4,096 for re-judging parse failures.
Eval Infrastructure#
| Component | Details |
|---|---|
| Judge model | Claude Sonnet 4.6 (claude-4-sonnet-20250514) |
| Judge temperature | 0.0 (deterministic) |
| Judge cost | ~$0.50 per 36-prompt eval (with prompt caching) |
| Eval generation | vLLM with LoRA serving on Modal A100-80GB |
| Eval generation params | temp=1.0, top_k=64, top_p=0.95, max_tokens=1024 |
| Eval prompts | 36 across 6 categories |
| Scoring script | scripts/data/judge_responses.py (prompt caching) |
| Scoring library | src/foundry/press/evaluate.py |
| Results directory | data/eval/results/ |
| Scoring methodology | docs/scoring-methodology.md |
Cost Summary#
| Run | Training | Eval | Data Gen | Total |
|---|---|---|---|---|
| ORPO v3b | ~$5 | ~$0.50 | $0 | ~$6 |
| ORPO v4 | ~$10 | ~$0.50 | ~$6.15 | ~$17 |
| Qwen 3 v1 | ~$15 | ~$0.50 | $0 | ~$16 |
| Qwen 3 v2 | ~$15 | ~$0.50 | $0 | ~$16 |
| LR sweep (v3+v4) | ~$30 | ~$1.00 | $0 | ~$31 |
| SFT v1+v2 | ~$10 | ~$1.00 | $0 | ~$11 |
| R2 | ~$8 | ~$0.50 | ~$4.05 | ~$13 |
| GGUF conversion | ~$5 | — | — | ~$5 |
| Autoresearch (8 runs) | ~$35 | ~$5 | $0 | ~$40 |
| Cumulative | ~$155 |