Skip to main content
Anneal · Variant

Temper — Fixed-Point Deepen

One plan. Heated and cooled repeatedly. The Red-Team Trinity attacks it at every depth, Momus scores it 0–100, and a deterministic Python script decides when to stop. The loop exits when the numbers say stop — not when the LLM says stop.

METISstage 3Prometheus-Temperfull rewriteRed-Team Trinity (parallel, EVERY depth)redteam-securityredteam-scoperedteam-assumptionsMOMUSscore 0–100per depthconvergence-check.pydeterministicre-loop (continue)ATLASemit XML +plan dir

The deepen loop in detail

Depth 0 is the seed. Every subsequent depth gets the prior plan, the full red-team envelopes, and Momus's score. Every rewrite is a full rewrite — not a patch.

pseudocodedeepen loop
Seed (depth 0):
  1. Prometheus-Temper writes plan_0 from (Metis directives, probe report)
  2. Red-Team Trinity attacks plan_0 in parallel
  3. Momus scores plan_0 → score_0
  4. convergence-check.py(depth=0, scores=[score_0], cap=N) → continue or exit

Iteration (depth N, N ≥ 1):
  1. Prometheus-Temper rewrites plan_{N-1} with:
       (plan_{N-1}, momus_envelope_{N-1}, redteam_envelopes_{N-1},
        metis_directives, depth_scores)
  2. Red-Team Trinity attacks plan_N in parallel
  3. Momus scores plan_N → score_N
  4. convergence-check.py(depth=N, scores=depth_scores, cap=N) → exit?

Exit: plan_final = plan_N where loop exited
Every rewrite is a full rewrite, not a patch. Prometheus-Temper reads the prior plan and its critiques as context, then writes a new plan from scratch. This is load-bearing — patching propagates defects; rewriting reconsiders the whole structure.

Three rules. First one wins.

The loop exits when any one of these is true. The script is the spec — not the LLM, not an implicit judgment.

  1. Rule 1 — Variance of top-3 depth scores < 0.3
    Scores have stabilized. The last three rewrites are producing substantively the same plan, and further iteration won't change that.
    depth_scores = [72, 81, 87, 86, 87]
    top_3_scores = [87, 87, 86]
    variance = 0.22  → < 0.3 → CONVERGED
  2. Rule 2 — |Δ score| < 0.15 across last 2 depths
    Marginal improvement, diminishing returns. The plan is still changing but the changes are small enough that further depth won't produce qualitatively better output.
    depth_scores = [65, 78, 83, 84.1, 84.0]
    delta = |84.0 - 84.1| = 0.1  → < 0.15 → CONVERGED
  3. Rule 3 — depth == hard_cap
    Runaway iteration guard. Default cap is 3; user-configurable 1–5. This rule guarantees no run spins forever.
    depth_scores = [45, 52, 58, 61, 63]
    depth = 5, cap = 5  → CONVERGED (cap reached)

How deep do you need to go?

DepthSpawnsWall-clockUse when
1~8~3 minYou want the deepen discipline but trust the seed
2~16~5 minPlan has one known weakness the seed won't catch
3 (default)~24~7 minTypical complex-but-scoped task
4~32~10 minPlan requires three substantive rewrites
5~40~13 minReally melt the rock

The 0–100 quality estimate

Momus's score is not a sum of findings — it's a direct quality estimate: "would I, as a senior engineer, sign off on this plan for implementation right now?"

RangeVerdictMeaning
100SAFEShip it now — no remaining concerns
85–99SAFEAll major gaps closed, only minor polish left
70–84CAUTIONNon-blocking concerns, plan is implementable
50–69RISKYSignificant gaps, human review required
0–49BLOCKPlan is not implementable as written
Score anchors (from docs/scoring-rubric.md):
95 — Every phase has a measurable gate. Hephaestus has a buildable target at phase-0. I'd ship it.
75 — Correct and implementable but error-path handling is thin. Risk is manageable.
55 — Structural gaps — a named service isn't declared as a dependency. Needs a rewrite.
35 — Plan contradicts itself (phase-2 removes a file phase-4 edits). Not implementable.

Three real deepen runs

Example 1 — Unify OIDC and legacy JWT auth

Auth unification is a canonical deepen-friendly task. Pass 1 catches the happy path. Pass 2 catches the migration envelope. Pass 3 catches clock skew and token refresh edge cases — things a single planner pass never surfaces.

bashinvocation
/anneal-temper:anneal --depth 3 "Rewrite the auth middleware to unify OIDC
and legacy JWT flows. Both must continue to work during a 90-day transition.
The unified middleware must expose a single Resolver interface."
depthscorewhat changedaction
d068Covers OIDC + JWT but merges claim sets without precedence rules. Red-team-scope: 'What if both tokens are present with different user IDs?' No answer.continue
d182Adds claim-precedence rules (OIDC wins, JWT fallback) and dual-token rejection. Red-team-security: 'Clock skew between OIDC issuer and JWT issuer can reject valid tokens during refresh.' No answer.continue
d289Adds ±30s skew tolerance, refresh-path test matrix (early/on-time/late), and telemetry phase tracking skew distribution for 30 days before narrowing the tolerance.cap→exit

Why not Cast: Cast produced a 65-score plan on the same task — the skew-tolerance gap was not visible to a single planner pass. Temper's depth-1 rewrite surfaced it via Red-Team-Security; depth-2 fixed it.

Example 2 — Redesign the event bus

The problem is bounded (event bus, not general messaging) but the design space is rich — Kafka vs NATS vs Redis Streams vs custom Postgres-backed log. At depth 5, the pivots at depth 1 and depth 3 would never happen under Cast's single-pass discipline.

bashinvocation
/anneal-temper:anneal --depth 5 "Redesign the event bus: currently a mix of
Postgres LISTEN/NOTIFY, Redis pub/sub, and in-process EventEmitter.
Consolidate into one bus: durable delivery, ordered per-aggregate,
replay from arbitrary offset, cross-service fanout."
depthscorewhat changedaction
d052Proposes Kafka. Red-team-scope: 'We have no Kafka operator; adding one is a ~6-month project.'continue
d167Pivots to NATS JetStream. Red-team-assumptions: 'We don't operate NATS either; same problem.'continue
d278Pivots to Postgres-backed log + Redis Streams notifications. Red-team-security: 'Redis Streams has no auth.'continue
d384Adopts Postgres-for-everything, Redis removed. Red-team-scope: 'Postgres LISTEN doesn't survive replica failover.'continue
d486Adds leader-only listener pattern with PostgreSQL logical replication as future upgrade path.delta→exit

Example 3 — Replace a flaky migration

The task is narrow. Depth 1 covers obvious phased migration. Depth 2 catches concurrent inserts during backfill, trigger interactions, and replica lag — the exactly the "weird edge cases" that cost an incident.

bashinvocation
/anneal-temper:anneal --depth 2 "The 2025-08-12 migration that added
tenant_id locks tables for ~4 minutes in production. Rewrite the migration
to complete with zero downtime. Existing data must be preserved."
depthscorewhat changedaction
d078Phases the migration (NULLable column → batch backfill → NOT NULL → RLS policy). Red-team-assumptions: 'Batch backfill holds a long-lived transaction; will bloat WAL on a busy replica.'continue
d188Adds explicit batch commit boundaries (every 10k rows) and monitoring gate (pause if replica lag > 30s). Row-level backfill becomes a state-machine with resumable checkpoint table.cap→exit

Right tool, right problem

Use Temper when: the problem rewards iteration (auth, event bus, migrations, retry semantics); the solution space is bounded; you have budget for depth 3+; you want a reproducible convergence trail (depth-history.json).
Don't use Temper when: the task needs breadth (multiple plausible architectures — use Alloy); it's a simple bug fix (Cast is 3× cheaper); "better" is genuinely subjective (naming, UX copy, API ergonomics — convergence rules don't apply to subjective taste).

What Temper writes to disk

FileWhat's in it
depth-history.jsonPer-depth plans, envelopes, scores, convergence check output — diff-able
plan-depth-0.mdRaw seed plan (preserved, not deleted)
plan-depth-1.md, plan-depth-2.md…Raw intermediate plans per depth
reviews/momus-envelope-depth-N.yamlOne Momus envelope per depth with score 0-100
reviews/redteam-*-envelope-depth-N.yamlThree red-team envelopes per depth
plan/plan.mdFinal blended plan overview ≤80 lines
plan/phase-NN-*.mdFinal phase files with success criteria
{variant}-{run_id}.xmlOpus 4.7 semantic-XML prompt for one-shot execution
rollup.yamlconvergence_reason, depth_scores, simultaneous_pass, emission decision
depth-history.json is designed to be diff-able. Pipe consecutive depth plans through diff -u to see exactly what each rewrite changed.
Temper's defining property: the Red-Team Trinity runs at every depth, not just once. That's what makes the deepen loop converge on a genuinely stronger plan instead of oscillating.anneal-temper · docs