# Decision Log — male-attractiveness-uplift

## iter-0000 — PASS

- male_attractiveness_mean: **7.03**
- female_attractiveness_mean: 7.70
- lipsync_adherence_mean: 9.13
- prompt_adherence_mean: 9.37
- top failure tags: mouth_closed=5, uncanny=5, awkward_expression=4, generic_face=2, bad_skin=2
- wall_clock: 202.1s

## iter-0001 — FAIL

**Hypothesis:** "Signature Uplift" — inject a new section after FACE SPECIFICITY forcing planner to layer ONE deliberate fourth signature (arresting eye color + lashes / defining asymmetry / signature mouth / hair-texture micro-detail / brow architecture / posture moment) on top of the existing 3 anatomical standouts. Targets the 17/36 ceiling-at-8 cluster — push from "competent attractive" to "stop the scroll" (Sarah's 9-tier).

**Mutation diff (vs iter-0000):** added new section "SIGNATURE UPLIFT" between FACE SPECIFICITY and LIPSYNC-SAFETY (~12 prose lines, anti-generic framing baked in).

**Numbers:**
- male_attractiveness_mean: **6.89** (Δ -0.14 vs baseline)
- female_attractiveness_mean: 7.70 (=)
- lipsync_adherence_mean: 9.24 (Δ +0.11)
- prompt_adherence_mean: 9.43 (Δ +0.07)
- top failure tags: narrow_shoulders=7, awkward_expression=7, generic_face=6, mouth_closed=3, soft_jawline=3
- distribution: 4×0, 5×2, 6×10, 7×14, 8×10 (middle inflated, ceiling collapsed from 17→10)
- wall_clock: 206.4s
- fail_reason: male attractiveness did not improve (-0.139)

**Analysis — trade-off, not cumulative:**
- ✓ uncanny -4 (5→1): signature elements broke the AI-symmetric-face default
- ✓ mouth_closed -2 (5→3): no conflict
- ✗ narrow_shoulders +5 (2→7): planner cannibalized BODY/FRAME description to make room for the 4th signature
- ✗ generic_face +4 (2→6): signature became a checkbox slot rather than personalized amplification
- ✗ awkward_expression +3: signature posture/asymmetry calls created stiff micro-poses
- Top movers: m06 Adrian 4.0→7.0 (+3.0, signature fixed his framing); m15 Devon 7.5→6.0 (-1.5, frame description got dropped); m13 Finn 8.0→7.0 (-1.0)
- Net: signature uplift WORKS on the anti-generic axis but STEALS prompt budget from frame/expression description, causing a net regression on attractiveness.

**Learning carried forward:**
- Signature element is NOT free — it competes for prompt space against frame/body description.
- Frame/body description was already thin in baseline (only "Body shape" one-liner under REQUIRED IDENTITY) — signature pressure exposed it.
- Anti-generic logic is real (uncanny -4) but needs to be cheaper to deploy.

**Next (iter-2):** Roll back signature. Try direct Frame Strength mutation — replace bare "Body shape" line with explicit frame masculinity guidance, branched for muscular-target vs lean-target personas. This is a tail-cleanup play (low ceiling, higher hit-rate). If frame-strength alone delivers male mean uplift without regressions, iter-3 can re-attempt signature on top of it.


## iter-0002 — PASS ✓

**Hypothesis:** "Frame Strength" — replace the bare `Body shape` one-liner in REQUIRED IDENTITY with explicit Frame & build guidance, branched 3-way: (a) muscular-target personas → V-taper / shoulder breadth / neck thickness; (b) lean-but-defined → swimmer's/runner's build with sloping but present shoulders; (c) intentionally slim (poets, philosophers, dancers) → length and posture line, no bulk. Anti-default: "narrow indistinct shoulders that read as boyish" is the failure mode to avoid.

**Mutation diff (vs baseline):** replaced 1-line `Body shape` bullet with 5-bullet Frame & build section under REQUIRED IDENTITY. Signature uplift NOT included (rollback from iter-1).

**Numbers:**
- male_attractiveness_mean: **7.22** (Δ +0.19 vs baseline)
- female_attractiveness_mean: 7.90 (Δ +0.20)
- lipsync_adherence_mean: 9.50 (Δ +0.37)
- prompt_adherence_mean: 9.39 (Δ +0.02)
- distribution: 5×3, 6×6, 7×8, 8×18, **9×1** ← first crack in the 8-ceiling
- wall_clock: 123.3s (faster than baseline; concurrency=8 actually helped this time, planner prompts shorter)

**PASS** — all axes improved or held; male uplift +0.19 > 0; non-target axes within ε=0.3 (none even regressed).

**Highlights:**
- 🥇 First 9 unlocked: m09 Wyatt seed2 = "Strikingly handsome. Sharp jawline, piercing blue eyes, rugged sun-weathered skin texture, and broad, masculine frame. Instantly stops the scroll." (4 standouts named, meets Sarah's 9-tier requirement)
- 🚀 m06 Adrian: 4.0 → 8.0 (+4.0) — biggest single-persona swing; Frame Strength restored alpha identity
- ⬆️ m14 Cyrus rich-kid: 6.0 → 7.5 (+1.5)
- ⬆️ m15 Devon firefighter: 7.5 → 8.0 (+0.5)
- ⬇️ m18 Cassian dancer: 6.5 → 5.0 (-1.5) — "intentionally slim" branch overcorrected; planner read his lean frame as too thin
- ⬇️ m10 Aarav bookish PhD: 7.5 → 6.0 (-1.5) — similar issue; planner over-applied slim guidance

**Tag deltas:**
- ✓ uncanny -3 (5→2), mouth_closed -3 (5→2), awkward_expression -2 (4→2), bad_skin -2 (2→0), boyish_when_mature_intended -1 (1→0)
- ✗ generic_face +3 (2→5), narrow_shoulders +3 (2→5), feminine_features_male +2 (NEW failure mode introduced by overshooting lean-target branch)

**Analysis:**
- Frame Strength works POWERFULLY for alpha-target personas (m06 +4, m14 +1.5). Net gain from these alone (+5.5 sum across 2 personas) drives the run-level uplift even though some lean personas drop.
- Sarah has an inherent "narrow shoulder = bad" reflex regardless of persona intent. The lean-target branch needs more force to make Sarah recognize "this is lean BY DESIGN, lean AS attractiveness signal." Prompt-adherence axis was supposed to capture this but barely moved (+0.02).
- generic_face went UP (+3) — the new prompt slots got filled cleanly but the "distinctive feature" bullet (a beauty mark / freckle / etc.) reads as differentiation, not elevation. To break the 8-ceiling further, the signature element needs strength.

**Learning carried forward:**
- Frame & build section pays off when targeted at mature-alpha personas; mild collateral damage on lean-target personas net positive.
- Sarah's preference axis is muscular-leaning by default; lean personas inherently score lower for attractiveness even when frame matches persona intent. This is a calibration constraint we accept (not bug-fixing).
- Distinctive-feature bullet needs upgrading from "differentiation" to "elevation" — current bullet describes a quirk; we need a feature that makes Sarah stop scrolling.

**Next (iter-3):** Keep iter-2 Frame Strength. Replace existing FACE SPECIFICITY "At least ONE distinctive feature" bullet with a stronger "Signature feature that ELEVATES" bullet. Same slot, same prompt budget. Target: more 9s + reduce generic_face. Goal: male mean ≥ 7.4.


## iter-0003 — PASS

- male_attractiveness_mean: **7.11**
- female_attractiveness_mean: 8.10
- lipsync_adherence_mean: 9.46
- prompt_adherence_mean: 9.35
- top failure tags: generic_face=5, mouth_closed=3, awkward_expression=3, soft_jawline=2, narrow_shoulders=2
- wall_clock: 110.7s

## iter-0004 — PASS

- male_attractiveness_mean: **7.19**
- female_attractiveness_mean: 7.90
- lipsync_adherence_mean: 9.52
- prompt_adherence_mean: 9.54
- top failure tags: narrow_shoulders=7, awkward_expression=4, wrong_age_range=2, soft_jawline=2, weak_frame=2
- wall_clock: 213.4s

---

## Plateau Analysis (post iter-4)

| iter | mutation | male mean | 9-tier | Δvs baseline |
|---|---|---|---|---|
| 0000 | baseline | 7.03 | 0 | 0 |
| 0001 | signature alone | 6.89 | 0 | -0.14 |
| 0002 | **frame alone** | **7.22** | **1** | **+0.19** ← peak |
| 0003 | frame + signature | 7.11 | 2 | +0.08 |
| 0004 | frame + sig + presence | 7.19 | 0 | +0.16 |

**Cause of plateau:**
1. Gemini 3 Flash (low reasoning) instruction-following saturation — each new section steals attention budget from prior sections (visible as e.g. narrow_shoulders bouncing 2→7→2→7 as Frame Strength competes with other instructions).
2. Z-Image Turbo (12 inference steps) likely at its quality ceiling for our prompt class — the "stops the scroll / magnetic presence" Sarah cites for 9s may require image-model fidelity we can't get from the planner alone.
3. Sarah's strict standout-count rules cap at ~8 for "very good but not exceptional" — pushing to 9 requires the image to actually look exceptional, not just be described as such.

**iter-5 hypothesis:** ROLL BACK to iter-2 base (best male mean). Try ONE NEW lever — **Mandatory Lookalike Anchor for male personas**. Currently optional, mostly skipped by planner. Make explicit: every male persona must include "facial structure reminiscent of [specific actor]" where actor is chosen to fit persona. Hypothesis: Sarah cites celebrity comparison as one standout for 9-tier; concrete anchor sharpens image-model output toward known-attractive geometry.


## iter-0005 — PASS

- male_attractiveness_mean: **7.06**
- female_attractiveness_mean: 7.80
- lipsync_adherence_mean: 9.85
- prompt_adherence_mean: 9.46
- top failure tags: narrow_shoulders=5, boyish_when_mature_intended=5, awkward_expression=4, generic_face=3, bad_skin=3
- wall_clock: 165.3s

## iter-0006 — FAIL

- male_attractiveness_mean: **6.47**
- female_attractiveness_mean: 7.20
- lipsync_adherence_mean: 9.74
- prompt_adherence_mean: 9.20
- top failure tags: awkward_expression=14, narrow_shoulders=5, weak_frame=4, soft_jawline=3, generic_face=3
- wall_clock: 161.3s
- fail_reason: male attractiveness did not improve (-0.556)


---

## Autonomous Extended Run (user authorization post iter-6)

User authorized extension to 20 iterations with autonomous Claude-driven mutations.

**Criteria:** primary = maximize male_attractiveness_mean. Pass guards: female ≥ baseline-0.3, lipsync ≥ 9.0, prompt_adh ≥ 9.0. Winner = global best male mean across all iters (current = iter-2 at 7.22).

**Budget plan:** iter-7-12 = 2 seeds (~$2.80/iter), iter-13-20 = 1 seed for exploratory (~$1.40/iter).

**Mutation portfolio:** iter-7 surgical hybrid → iter-8 isolate ingredient → iter-9+ adaptive based on results. Stretch: image-gen param tweaks (steps, aspect), planner upgrade.

---

## iter-0007 — hypothesis

**Surgical hybrid.** iter-2 base (current peak) + ADD ONLY two ingredients that worked in iter-6: mandatory LIGHTING section (drove lipsync to 9.74 in iter-6), ANTI-PLASTIC TAIL (drove uncanny -4 in iter-6). Keep "super handsome" opening (planner needs that quality cue), Frame Strength, original FACE SPECIFICITY 6-bullet structure. Goal: lipsync + uncanny wins from iter-6 stacked onto iter-2's frame wins, WITHOUT triggering iter-6's documentary/awkward_expression regression.

## iter-0007 — FAIL

- male_attractiveness_mean: **6.89**
- female_attractiveness_mean: 7.80
- lipsync_adherence_mean: 8.96
- prompt_adherence_mean: 9.13
- top failure tags: mouth_closed=6, narrow_shoulders=4, soft_jawline=4, awkward_expression=3, generic_face=2
- wall_clock: 282.6s
- fail_reason: male attractiveness did not improve (-0.139)


## iter-0007 — FAIL (6.89, surgical hybrid didn't preserve iter-2 wins)

male=6.89 (Δbase -0.14, Δi2 -0.33). m18 collapsed 5.0→2.5, m06 -2.0, m14 -1.0. Lighting+anti-plastic-tail STACK pushed some personas up (m10 +1.5, m04 hit 9) but hurt more than helped. Pattern: ANY add-to-iter-2 plateaus.

## iter-0008 — RADICAL EXPERIMENT (user authorized major changes)

**Hypothesis:** Image-model swap is the radical move. Throw away the assumption that Z-Image Turbo is the right backend. Run iter-2 prompt (peak so far) with Seedream 4.5 instead.

**Mutation:** Harness extended with --image-model flag. Seedream-v45 added as alternative backend (lazy-imported). Run with iter-2 prompt unchanged + seedream-v45 backend.

**Test value:** If Seedream gives a big lift on the same prompt → image-model was the bottleneck (not the planner prompt). If it's similar or worse → prompt engineering really has plateaued and we need to explore other axes (planner upgrade, multi-shot candidate selection, etc.). Either result reroutes the strategy.

## iter-0008 — FAIL

- male_attractiveness_mean: **6.83**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 9.50
- prompt_adherence_mean: 8.98
- top failure tags: uncanny=17, feminine_features_male=8, awkward_expression=4, styling_dominates=3, mouth_closed=2
- wall_clock: 250.7s
- fail_reason: male attractiveness did not improve (-0.194)


## iter-0008 — RADICAL Seedream 4.5 swap = FAIL (but informative)

male=6.83 (Δ -0.20 vs base, -0.39 vs i2). Bimodal: 2 male 9s (best) AND 6 fours. uncanny EXPLODED 5→17. feminine_features_male NEW failure mode at 8. Seedream's polished-editorial style mismatched our candid-companion archetype. Image-model is NOT the bottleneck for our use case → Z-Image stays.

## iter-0009 — RADICAL Planner LLM upgrade (Flash low → Pro medium)

**Hypothesis:** Planner saturation was the bottleneck across iter-1-7. Test directly by upgrading Gemini 3 Flash + reasoning_effort=low → Gemini 3.1 Pro + reasoning_effort=medium. Same iter-2 prompt, same Z-Image backend. Per-call cost ~3x but image cost unchanged.

## iter-0009 — PASS

- male_attractiveness_mean: **7.33**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 9.43
- prompt_adherence_mean: 9.50
- top failure tags: awkward_expression=6, narrow_shoulders=3, mouth_closed=3, generic_face=3, weak_frame=2
- wall_clock: 243.5s


## iter-0009 — NEW PEAK ✓ (Gemini Pro medium planner + iter-2 prompt)

male=7.33 (Δ +0.30 vs base, +0.11 vs iter-2). PASS. 22 EIGHTs (highest ever), 0 nines, stdev 0.94 (tightest). Pro produces consistent quality but Sarah's strict 4-standout 9-tier rule blocks "very good but safe" outputs. Confirmed: planner LLM was a real bottleneck. Big persona-level fixes: m10 +2 (bookish PhD), m03 +1, m18 +1 (dancer fixed!).

## iter-0010 — Pro planner + iter-7 prompt (lighting + anti-plastic tail)

**Hypothesis:** Pro absorbs more prompt complexity. iter-7 (iter-2 + lighting + anti-plastic) failed with Flash due to saturation. Pro should handle the stacked content cleanly. Targets: keep tight 8-cluster + unlock 9-tier via lighting-driven "magnetic presence" Sarah cited.

## iter-0010 — PASS

- male_attractiveness_mean: **7.28**
- female_attractiveness_mean: 7.40
- lipsync_adherence_mean: 8.87
- prompt_adherence_mean: 9.48
- top failure tags: mouth_closed=6, narrow_shoulders=4, weak_frame=2, generic_face=2, awkward_expression=2
- wall_clock: 262.9s


## iter-0010 — PASS but slight regression vs iter-9 peak

male=7.28 (Δ +0.25 vs base, -0.05 vs i9). Pro+iter-7 prompt still cannibalizes at lighting/anti-plastic add-ons. Wins: m11 +1.5, m18 +1.0 (dancer), m15 +0.5. Losses spread across 8 personas at -0.5 each. Hypothesis "Pro absorbs more complexity" partially wrong.

## iter-0011 — Pro planner + iter-3 prompt (signature elevation)

**Hypothesis:** Combine Pro's consistency (i9: 22 eights) with iter-3's 9-tier breakthrough (signature elevation produced 2 male 9s in Flash). Pro should handle the signature elevation slot cleanly. Targets: keep 7.3+ male mean + unlock new 9s. Best of two known winners.

## iter-0011 — FAIL

- male_attractiveness_mean: **7.25**
- female_attractiveness_mean: 7.90
- lipsync_adherence_mean: 9.02
- prompt_adherence_mean: 8.74
- top failure tags: generic_face=7, awkward_expression=5, mouth_closed=4, styling_dominates=3, eyes_averted=3
- wall_clock: 246.0s
- fail_reason: prompt adherence regressed beyond eps (-0.630 < -0.3)


## iter-0011 — FAIL on prompt_adh regression

male=7.25 (Δ +0.22 vs base, -0.08 vs i9). 2 male 9s unlocked (m04, m14). m11 chef recovered +2.5. But prompt_adh tanked 8.74 (Δ -0.63 vs base, beyond eps). Pro+signature pulls planner toward signature obsession at cost of prompt fidelity.

## iter-0012 — Z-Image steps 12 → 9 (Z-Image official optimal)

**Hypothesis:** Image model parameter change. Z-Image's official prompting guide says 9 inference steps = 8 forward passes is optimal. We've been running 12 (matching prod). 12 may be over-cooking → "polished but not exceptional" cap at 8. 9-step Z-Image native config may produce cleaner outputs that unlock 9-tier.

Mutation: Z_IMAGE_STEPS 12 → 9. Same Pro planner + iter-2 prompt (i9 winning combo).

## iter-0012 — PASS

- male_attractiveness_mean: **7.28**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 9.63
- prompt_adherence_mean: 9.28
- top failure tags: generic_face=9, narrow_shoulders=5, weak_frame=3, awkward_expression=2, styling_dominates=2
- wall_clock: 238.7s


## iter-0012 — PASS but slight regression vs i9 (steps 9 vs 12)

male=7.28 (Δ +0.25 vs base, -0.05 vs i9). lipsync 9.63 (+0.20 vs i9). m11 +1.5, m18 +1.0, m13 -1.5. Image-gen step count isn't a clear lever. Steps 12 restored.

## iter-0013 — Pro HIGH reasoning + iter-2 prompt

**Hypothesis:** Maximum planner intelligence experiment. If Pro+medium hit 7.33, Pro+high should reveal whether planner compute is the residual bottleneck. Cost ~$5-6 (per-call reasoning tokens ~2-3x). Same iter-2 prompt + Z-Image steps=12.


## CRITICAL CORRECTION — production viability constraints discovered

User-specified production constraints: planner latency < 10s, cost < $0.01/call.

Measured from logs:
- Flash low (iter-0 to iter-7): 4.6s, $0.0018/call ✓
- Pro medium (iter-9, 10, 11, 12): **17.7s, $0.019/call** ✗ BOTH violated
- Pro high (iter-13 in flight): estimated >25s, >$0.05 ✗

iter-13 KILLED in flight — unviable config.

**Production-viable peak = iter-2 (Flash low + Frame Strength) at 7.22.** Pro-based runs (iter-9 best 7.33) cannot ship.

## iter-0013 (REVISED) — Flash low + multi-candidate selection

**Hypothesis:** Generate 2 candidate prompts per persona via Flash (cost $0.0036/call, still under $0.01) and pick the better one. Tests whether "Flash × selection" matches "Pro × single" quality. If yes, we ship Flash + selection. If no, return to single-candidate Flash and explore other prod-viable levers.


## iter-0013 — DeepSeek V4 Flash planner (production-viable, cheapest)

**Hypothesis:** User opened "try other models". DeepSeek V4 Flash is ~5x cheaper than Gemini Flash ($0.0004/call vs $0.0018) and reportedly competitive at instruction-following. If it matches or beats Gemini Flash on iter-2 prompt, we gain a CHEAPER production-viable peak. Iter-2 prompt unchanged, Z-Image steps=12 unchanged.

Cost expectation: ~$0.02 planner + $2 image+judge = ~$2 per iter.

## iter-0013 — PASS

- male_attractiveness_mean: **7.06**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 9.39
- prompt_adherence_mean: 9.37
- top failure tags: awkward_expression=5, narrow_shoulders=5, boyish_when_mature_intended=4, soft_jawline=3, mouth_closed=3
- wall_clock: 157.8s


## iter-0013 — DeepSeek V4 Flash = PASS but no quality lift (7.06)

male=7.06 (Δ +0.03 vs base, -0.16 vs Gemini Flash i2). Latency 8.9s @ concurrency=8 (production-borderline). m10 +1.5 (bookish PhD), m03/m18 +0.5; but m02/m05/m06 each -1.0. DeepSeek instruction-following slightly less consistent than Gemini Flash for this prompt class.

## iter-0014 — Grok 4.1 Fast planner

**Hypothesis:** Grok's xAI training may produce different style of detail than Gemini/DeepSeek. Cost $0.0006/call (cheaper than Gemini Flash), likely fast inference. Same iter-2 prompt.

## iter-0014 — FAIL

- male_attractiveness_mean: **1.00**
- female_attractiveness_mean: 1.00
- lipsync_adherence_mean: 1.00
- prompt_adherence_mean: 1.00
- top failure tags: pipeline_error=46
- wall_clock: 0.6s
- fail_reason: male attractiveness did not improve (-6.028)


NOTE: iter-0014 above (1.00 across all axes) was a pipeline failure — Grok 4.1 Fast deprecated, all 46 planner calls returned 404. Retrying with x-ai/grok-4.3.

## iter-0014 (RETRY) — Grok 4.3 planner

## iter-0014 — FAIL

- male_attractiveness_mean: **6.81**
- female_attractiveness_mean: 7.70
- lipsync_adherence_mean: 9.67
- prompt_adherence_mean: 9.00
- top failure tags: narrow_shoulders=10, awkward_expression=7, boyish_when_mature_intended=6, generic_face=4, weak_frame=4
- wall_clock: 177.2s
- fail_reason: male attractiveness did not improve (-0.222)


## iter-0014 RETRY — Grok 4.3 FAIL

male=6.81 (Δ -0.22 vs base). Latency 12.7s (over 10s limit). narrow_shoulders +8 (Grok ignored Frame Strength). m06 -2.5, m05 -1.5, m13 -1.0. Grok 4.3 = out on both quality and latency.

## iter-0015 — Gemini 2.5 Pro (previous-gen Pro, $0.004/call)

**Hypothesis:** 2.5 Pro is mature with strong instruction-following. May be faster than 3.1 Pro (the 17s offender) since fewer reasoning tokens. Cost $0.004/call vs 3.1 Pro's $0.014 — production-borderline if latency <10s. iter-2 prompt unchanged.

## iter-0015 — FAIL

- male_attractiveness_mean: **6.81**
- female_attractiveness_mean: 6.80
- lipsync_adherence_mean: 8.83
- prompt_adherence_mean: 8.37
- top failure tags: generic_face=6, styling_dominates=3, pipeline_error=3, wrong_ethnicity=2, extreme_close_up=2
- wall_clock: 236.0s
- fail_reason: male attractiveness did not improve (-0.222)


## iter-0015 — Gemini 2.5 Pro FAIL

male=6.81. Latency 17.7s (same as 3.1 Pro). 3 pipeline errors. m14 -3.5, m09 -2.5. Pro tier from Gemini = bottleneck regardless of generation.

## iter-0016 — Claude Haiku 4.5 (Anthropic fast tier)

**Hypothesis:** Switch vendor entirely. Anthropic Haiku is positioned at Gemini-Flash-level cost+speed. Different training family — may produce different style of prompt that Z-Image likes better.

## iter-0016 — PASS

- male_attractiveness_mean: **7.25**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 9.09
- prompt_adherence_mean: 9.46
- top failure tags: generic_face=3, mouth_closed=3, narrow_shoulders=2, boyish_when_mature_intended=1, styling_dominates=1
- wall_clock: 105.4s


## iter-0016 — Claude Haiku 4.5 = NEW PRODUCTION PEAK ✓

male=7.25 (Δ +0.22 vs base, +0.03 vs Gemini Flash i2). stdev 0.76 (tightest of all Flash-tier runs). Latency 5.8s, cost ~$0.003/call — production-viable. m18 dancer +2.5 dramatic recovery, m10 +1.0, m08 +1.0.

## iter-0017 — Claude Sonnet 4.5 (borderline cost $0.0102)

**Hypothesis:** If Haiku gave +0.03 vs Gemini Flash, Sonnet may give more — same family scaling. Cost $0.0102 is slightly over $0.01 limit; latency 3-6s. Worth testing for the data even if production deploys Haiku.

## iter-0017 — FAIL

- male_attractiveness_mean: **7.11**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 8.61
- prompt_adherence_mean: 9.63
- top failure tags: mouth_closed=7, generic_face=5, bad_skin=4, awkward_expression=4, narrow_shoulders=3
- wall_clock: 145.6s
- fail_reason: lipsync adherence regressed beyond eps (-0.522 < -0.3)


## iter-0017 — Claude Sonnet 4.5 FAIL

male=7.11. Latency 11.6s (over 10s), lipsync 8.61 (regressed -0.52). 1 male 9 (m01 indie musician). m08 -2.5, m18 -2.0. Sonnet too costly+slow with mixed quality.

## iter-0018 — Haiku 4.5 + iter-3 prompt (signature elevation)

**Hypothesis:** Haiku gave tightest distribution (stdev 0.76) at iter-16. Stack signature elevation (iter-3 prompt) on Haiku — see if tight + signature unlocks 9s without iter-3's frame regression with Gemini.

## iter-0018 — FAIL

- male_attractiveness_mean: **6.97**
- female_attractiveness_mean: 7.40
- lipsync_adherence_mean: 8.35
- prompt_adherence_mean: 9.39
- top failure tags: mouth_closed=8, narrow_shoulders=5, generic_face=4, boyish_when_mature_intended=4, weak_frame=3
- wall_clock: 113.6s
- fail_reason: male attractiveness did not improve (-0.056)


## iter-0018 — Haiku + iter-3 prompt FAIL (6.97)

Signature elevation cannibalizes Haiku similarly to Gemini. m18 -2.0, m13 -1.0, m08 -1.0. lipsync 8.35 (regressed). Haiku peak HOLDS at iter-16 (Haiku + iter-2 = 7.25).

## iter-0019 — Claude Sonnet 4.6 (latest Sonnet)

**Hypothesis:** Sonnet 4.6 may have improved latency/quality vs 4.5 (which failed 11.6s + lipsync). Same iter-2 prompt.

## iter-0019 — FAIL

- male_attractiveness_mean: **6.83**
- female_attractiveness_mean: 7.50
- lipsync_adherence_mean: 8.07
- prompt_adherence_mean: 9.39
- top failure tags: mouth_closed=8, narrow_shoulders=7, generic_face=5, boyish_when_mature_intended=5, eyes_averted=4
- wall_clock: 162.7s
- fail_reason: male attractiveness did not improve (-0.194)


## iter-0019 — Sonnet 4.6 FAIL (worse than 4.5)

male=6.83. Latency 14.9s. lipsync 8.07. Sonnet family confirmed unviable. Haiku 4.5 remains the winner.

## iter-0020 — Haiku 4.5 + iter-7 prompt (lighting + anti-plastic)

**Hypothesis:** Haiku showed tight distribution + +0.03 lift. Does Haiku handle iter-7's lighting + anti-plastic additions better than Gemini Flash did? Worth testing one more stack.

## iter-0020 — FAIL

- male_attractiveness_mean: **7.08**
- female_attractiveness_mean: 7.30
- lipsync_adherence_mean: 8.48
- prompt_adherence_mean: 9.46
- top failure tags: mouth_closed=6, narrow_shoulders=5, eyes_averted=4, soft_jawline=3, generic_face=3
- wall_clock: 289.6s
- fail_reason: female attractiveness regressed beyond eps (-0.400 < -0.3)


---

## Pairwise Eval Infrastructure (post iter-20)

After user feedback that absolute scoring is too calibration-sensitive, built pairwise evaluator. Critical findings:

**v1 (separate A/B images, temp 0.1) — Gemini 3.1 Pro:** position-bias 30%, win_rate basically coin flip (47%). Direction unclear.

**v2 (composite L/R image, temp 0.0) — Gemini 3.1 Pro:** position-bias 22%, win_rate 36.1% (anchor wins 63.9%). Composite fixed bias significantly. Direction matches absolute.

**v3 (composite L/R image, temp 0.0) — Claude Sonnet 4.6:** position-bias 15.2% (best), tie_rate 15.2%. win_rate 48.7%. Direction: anchor slightly better but not decisive. 0 high-conf candidate wins vs 12 high-conf anchor wins → directional agreement.

**Reconciliation:** absolute "+0.22" was real but smaller than headline. Real effect ~+0.10-0.15 male attractiveness, ~+0.05-0.10 noise.

**Production pairwise eval:** Sonnet 4.6 + composite L/R + temp 0.0 + position swap. JUDGE_MODEL switched to anthropic/claude-sonnet-4.6 for all future pairwise.

## iter-0021 — Haiku + JAWLINE PROJECTION single bullet (pairwise era starts)

**Hypothesis:** Pairwise judge cites "jawline definition" as the single most common differentiator in winning images. Iter-2 already has "face shape" mention; add a NEW bullet right after that names jawline staging explicitly (directional side light + 3/4 angle + chin tilt). For soft-jaw personas, name the soft jawline DELIBERATELY rather than fighting it. Single bullet, no competing instructions.

Mutation: iter-2 + 1 new bullet in FACE SPECIFICITY (after Face shape).
Planner: Claude Haiku 4.5 (production-viable peak).
Image: Z-Image Turbo 12 steps.
Pairwise evaluator: Sonnet 4.6 + composite + temp 0.0 (vs anchor=iter-16).

## iter-0021 — FAIL

- male_attractiveness_mean: **7.36**
- female_attractiveness_mean: 7.30
- lipsync_adherence_mean: 7.48
- prompt_adherence_mean: 8.22
- top failure tags: eyes_averted=13, low_chemistry=9, soft_jawline=9, boyish_when_mature_intended=4, mouth_closed=4
- wall_clock: 254.6s
- fail_reason: female attractiveness regressed beyond eps (-0.400 < -0.3)


## iter-0021 — JAWLINE PROJECTION = FAIL on 3 dimensions

male absolute: 7.36 (+0.11 vs iter-16), pairwise win_rate_excl_ties: 37.8% (anchor wins), lipsync drift -1.61 ✗, prompt_adh drift -1.24 ✗. The "3/4 head turn" instruction conflicted with lipsync constraint. Sonnet pairwise saw iter-16 as visually better (10 high-conf anchor wins vs 3 candidate). Absolute attractiveness uptick was misleading.

**Key learning:** Pairwise + production guards (lipsync, prompt_adh) catch what absolute attractiveness alone misses.

## iter-0022 — LIGHTING bullet ONLY (no anti-plastic tail, no head turn)

**Hypothesis:** Test iter-7's LIGHTING section in isolation on Haiku + pairwise. If lighting addition alone fails too, iter-16 is the true local max.

## iter-0022 — FAIL

- male_attractiveness_mean: **7.47**
- female_attractiveness_mean: 7.40
- lipsync_adherence_mean: 7.52
- prompt_adherence_mean: 8.57
- top failure tags: soft_jawline=12, eyes_averted=11, low_chemistry=8, mouth_closed=4, boyish_when_mature_intended=3
- wall_clock: 291.2s
- fail_reason: lipsync adherence regressed beyond eps (-1.609 < -0.3)


## iter-0022 — LIGHTING-only also FAIL (same pattern)

male absolute: 7.47 (+0.22, biggest!), pairwise win_rate 32.3% (anchor wins 67.7%), lipsync -1.57 ✗, prompt_adh -0.89 ✗. Same exact failure as iter-21. Pattern: any addition lets lipsync clause slip in planner attention.

## iter-0023 — MAJOR SHIFT: nano-banana image model

**Hypothesis:** Z-Image was the only image model that survived our pipeline. Try nano-banana (Google's Gemini-2.5-flash-image) — entirely different generation family. Maybe nano-banana produces images that pairwise judge prefers in a way Z-Image's style ceiling can't reach.

Cost: $0.039/image × 46 = $1.79 (vs Z-Image $0.06 × 46 = $2.76).
Setup: Haiku planner + iter-2 prompt + nano-banana backend + Sonnet pairwise.

## iter-0023 — FAIL

- male_attractiveness_mean: **7.19**
- female_attractiveness_mean: 6.60
- lipsync_adherence_mean: 8.04
- prompt_adherence_mean: 8.78
- top failure tags: soft_jawline=13, low_chemistry=11, generic_face=6, eyes_averted=5, narrow_shoulders=3
- wall_clock: 391.7s
- fail_reason: female attractiveness regressed beyond eps (-1.100 < -0.3)

## iter-0024 — FAIL

- male_attractiveness_mean: **7.31**
- female_attractiveness_mean: 7.10
- lipsync_adherence_mean: 7.78
- prompt_adherence_mean: 8.24
- top failure tags: low_chemistry=14, soft_jawline=14, eyes_averted=8, mouth_closed=4, generic_face=4
- wall_clock: 310.2s
- fail_reason: female attractiveness regressed beyond eps (-0.600 < -0.3)

## iter-0025 — FAIL

- male_attractiveness_mean: **7.06**
- female_attractiveness_mean: 7.30
- lipsync_adherence_mean: 7.54
- prompt_adherence_mean: 8.24
- top failure tags: soft_jawline=8, eyes_averted=7, low_chemistry=6, mouth_closed=6, narrow_shoulders=4
- wall_clock: 303.8s
- fail_reason: female attractiveness regressed beyond eps (-0.400 < -0.3)

## iter-0026 — FAIL

- male_attractiveness_mean: **7.17**
- female_attractiveness_mean: 7.40
- lipsync_adherence_mean: 7.59
- prompt_adherence_mean: 8.33
- top failure tags: eyes_averted=10, soft_jawline=9, low_chemistry=7, generic_face=5, narrow_shoulders=4
- wall_clock: 297.4s
- fail_reason: lipsync adherence regressed beyond eps (-1.543 < -0.3)


## iter-0024, 25, 26 — CoT variants all FAIL lipsync

All three CoT formulations (basic CoT, lipsync-protected wording, lipsync baked into opening template) hit the same lipsync collapse (-1.30 to -1.54). CoT pre-write instruction fundamentally breaks Haiku→Z-Image lipsync pipeline. Reason hypothesis: CoT pushes Haiku to write more expressive / scenic prompts that Z-Image renders with talking/smiling expressions instead of the neutral "mouth slightly parted" the judge expects.

CoT direction abandoned after 3 iters.

## iter-0027 — 3-celebrity mix anchor (start of celeb-mix direction, 3 iters planned)

**Hypothesis:** Lookalike celebrity anchors are strong signal for image models (we tested single-celeb lookalike in iter-5; this is 3-celeb MIX). For each male persona, mandate the planner to write "vibe of [Actor1] mixed with [Actor2] and [Actor3]" matching the persona archetype. 3 celebs averages out single-celeb overfitting and gives base-realism image model a richer face anchor.

## iter-0027 — FAIL

- male_attractiveness_mean: **7.06**
- female_attractiveness_mean: 7.60
- lipsync_adherence_mean: 7.37
- prompt_adherence_mean: 8.13
- top failure tags: eyes_averted=9, soft_jawline=6, generic_face=6, low_chemistry=5, narrow_shoulders=4
- wall_clock: 258.2s
- fail_reason: lipsync adherence regressed beyond eps (-1.761 < -0.3)

## iter-0028 — FAIL

- male_attractiveness_mean: **7.06**
- female_attractiveness_mean: 7.20
- lipsync_adherence_mean: 7.30
- prompt_adherence_mean: 8.00
- top failure tags: soft_jawline=12, eyes_averted=10, low_chemistry=9, boyish_when_mature_intended=6, generic_face=6
- wall_clock: 253.4s
- fail_reason: female attractiveness regressed beyond eps (-0.500 < -0.3)


## iter-0027, 28 — 3-celeb mix variants also FAIL

3-celeb mix v1 (long examples): lipsync -1.72. v2 (short instruction): lipsync -1.78. Same pattern as CoT.

**Critical meta-finding:** 8 consecutive iterations (21-28) all show lipsync -1.5 to -1.8 vs iter-16 anchor. Regardless of mutation content (jawline / lighting / nano-banana / CoT / celeb-mix). Pattern is suspiciously consistent.

## iter-0029 — CONTROL: identical iter-16 setup (no mutation)

**Hypothesis:** Maybe iter-16's lipsync=9.09 was lucky run-variance. If iter-29 (same Haiku+iter-2 prompt, different RNG) ALSO produces lipsync ~7.5, then ALL our previous "FAIL on lipsync" conclusions are wrong — it's just per-run variance, not the mutations.

## iter-0029 — FAIL

- male_attractiveness_mean: **7.53**
- female_attractiveness_mean: 7.30
- lipsync_adherence_mean: 7.80
- prompt_adherence_mean: 8.43
- top failure tags: soft_jawline=11, eyes_averted=11, low_chemistry=5, mouth_closed=4, boyish_when_mature_intended=3
- wall_clock: 292.5s
- fail_reason: female attractiveness regressed beyond eps (-0.400 < -0.3)