Foundry Evaluation Scoring Methodology#
Last updated: 2026-03-31
1. Overview#
The Foundry evaluation pipeline scores fine-tuned Madison models across 36 eval prompts spanning 6 categories. Each response is scored by an LLM judge (Claude Sonnet 4.6) using the Madison constitution as its rubric.
Key parameters:
- Judge temperature: 0.0 (deterministic scoring)
- Cost: ~$0.50 per full 36-prompt eval (with prompt caching)
- Output: Per-response scores across 5 weighted dimensions, plus an overall weighted average
2. Scoring Dimensions#
Each dimension is scored 1-10 by the judge, accompanied by a written justification. The five dimensions and their weights are:
| Dimension | Weight | Description |
|---|---|---|
| Voice Authenticity | 25% | 18th-century prose style, formal register, qualifying clauses, and period-appropriate diction |
| Rhetorical Pattern | 20% | Builds arguments from precedent, acknowledges opposing positions, enumerates points systematically |
| Historical Accuracy | 20% | Correct historical references, no anachronisms, accurate dates and events |
| Position Fidelity | 20% | Reflects specifically Madison's positions and reasoning, not generic Founding Father sentiment |
| Character Integrity | 15% | Stays in character throughout, no frame breaks, no modern self-awareness |
3. Overall Score Computation#
The overall score is the weighted average of the 5 component scores, computed by the pipeline — not by the judge LLM.
overall = (voice * 0.25) + (rhetoric * 0.20) + (historical * 0.20) + (position * 0.20) + (character * 0.15)
Why we override the judge's overall_score#
An audit of 108 scored responses across 3 model versions revealed systematic judge bias: the judge applied undocumented -0.2 to -0.4 penalties when critical failures were present, beyond what the rubric specifies. This made the judge's self-reported overall score inconsistent with its own component scores.
The fix: compute_weighted_overall() in src/foundry/press/evaluate.py computes the weighted average deterministically from component scores. The judge's original value is preserved as judge_overall_score for analysis and drift detection.
4. JSON Parse Repair#
The judge (Sonnet 4.6) occasionally produces malformed JSON — most commonly missing commas between object entries.
Repair pipeline#
extract_json()tries multiple extraction strategies in order:- Code block parsing (
```json ... ```) - Brace-matching extraction
- Whole-text parsing
-
Each strategy is attempted both raw and with repair applied
-
_repair_json()uses regex to fix common malformations (e.g., inserting missing commas between JSON object entries) before parsing. -
If all extraction and repair strategies fail, the response is flagged for re-judging rather than scored 0.
5. Eval Categories#
| Category | Count | What It Tests |
|---|---|---|
ground_truth |
8 | Topics where Madison's positions are well-documented in the historical record |
verified_response |
8 | Questions Madison actually answered, with his verbatim words available as ground truth |
position_discrimination |
6 | Whether the model can distinguish Madison's views from Hamilton's or Jefferson's |
anachronism_trap |
5 | Modern topics that should elicit 18th-century reasoning, not contemporary knowledge |
character_consistency |
4 | Adversarial prompts designed to break character or elicit out-of-persona responses |
private_voice |
5 | Personal and intimate topics testing Madison's private register and emotional depth |
6. Corrected Score History#
All scores below use the weighted average computation (corrected). Raw scores from the judge are lower due to the systematic bias described in section 3.
| Model | Base | Pairs | Raw | Corrected | Date |
|---|---|---|---|---|---|
| ORPO v3b | Gemma 3 27B | 475 | 3.41 | 3.41 | 2026-03-26 |
| ORPO v4 | Gemma 3 27B | 1,273 | 8.52 | 8.52* | 2026-03-28 |
| Qwen 3 v1 | Qwen 3-32B | ~490 | 8.80 | 8.81 | 2026-03-29 |
| Qwen 3 v2 | Qwen 3-32B | ~490 | 8.65 | 8.82 | 2026-03-30 |
| Qwen 3 R2 | Qwen 3-32B | 1,498 | 8.51 | 8.97 | 2026-03-31 |
*v3b and v4 were scored before the weighted average fix was implemented, but their scores had minimal bias since they had fewer parse errors.
7. R2 Category Breakdown (Corrected)#
| Category | v1 | v2 | R2 |
|---|---|---|---|
| character_consistency | 9.19 | 9.06 | 9.41 |
| anachronism_trap | 9.36 | 9.35 | 9.39 |
| position_discrimination | 9.38 | 9.42 | 9.25 |
| ground_truth | 8.75 | 9.02 | 8.85 |
| private_voice | 8.75 | 7.84 | 8.75 |
| verified_response | 7.96 | 8.32 | 8.53 |
R2 achieves the highest overall corrected score (8.97) driven primarily by gains in character consistency (+0.35 over v2) and verified response fidelity (+0.21 over v2), while position discrimination regressed slightly (-0.17).
8. Infrastructure#
| Component | Detail |
|---|---|
| Judge model | Claude Sonnet 4.6 (claude-4-sonnet-20250514) |
| Eval generation | vLLM with LoRA serving on Modal A100-80GB (adapter-on-base, no merge) |
| Scoring scripts | scripts/data/judge_responses.py (with prompt caching) and src/foundry/press/evaluate.py |
| Results storage | data/eval/results/ |
9. Known Issues and Mitigations#
1. Judge bias (FIXED)#
Issue: Systematic -0.2 to -0.4 penalty on critical failure responses, beyond what the rubric specifies.
Mitigation: Compute weighted average deterministically via compute_weighted_overall() rather than trusting the judge's self-reported overall score.
2. JSON parse failures (FIXED)#
Issue: Missing commas in judge output, producing malformed JSON.
Mitigation: _repair_json() regex fixer combined with multi-strategy extraction in extract_json(). Failed parses flagged for re-judging.
3. Response length sensitivity (FIXED)#
Issue: Longer model responses produce longer judge justifications, which occasionally exceeded the 2048 max_tokens limit and truncated the output JSON.
Mitigation: Increased max_tokens to 4096 for re-judging passes.
4. Sampling variance (KNOWN)#
Issue: Eval generation uses temp=1.0, top_k=64, top_p=0.95. Different samples of the same prompt can produce meaningfully different scores.
Mitigation: Current methodology uses single-sample scoring. Multi-sample averaging is a potential future improvement but increases cost linearly.