Anneal · Variant

Temper — Fixed-Point Deepen

One plan. Heated and cooled repeatedly. The Red-Team Trinity attacks it at every depth, Momus scores it 0–100, and a deterministic Python script decides when to stop. The loop exits when the numbers say stop — not when the LLM says stop.

01 Iteration Structure

The deepen loop in detail

Depth 0 is the seed. Every subsequent depth gets the prior plan, the full red-team envelopes, and Momus's score. Every rewrite is a full rewrite — not a patch.

pseudocodedeepen loop

Seed (depth 0):
  1. Prometheus-Temper writes plan_0 from (Metis directives, probe report)
  2. Red-Team Trinity attacks plan_0 in parallel
  3. Momus scores plan_0 → score_0
  4. convergence-check.py(depth=0, scores=[score_0], cap=N) → continue or exit

Iteration (depth N, N ≥ 1):
  1. Prometheus-Temper rewrites plan_{N-1} with:
       (plan_{N-1}, momus_envelope_{N-1}, redteam_envelopes_{N-1},
        metis_directives, depth_scores)
  2. Red-Team Trinity attacks plan_N in parallel
  3. Momus scores plan_N → score_N
  4. convergence-check.py(depth=N, scores=depth_scores, cap=N) → exit?

Exit: plan_final = plan_N where loop exited

Every rewrite is a full rewrite, not a patch. Prometheus-Temper reads the prior plan and its critiques as context, then writes a new plan from scratch. This is load-bearing — patching propagates defects; rewriting reconsiders the whole structure.

02 Convergence Detection

Three rules. First one wins.

The loop exits when any one of these is true. The script is the spec — not the LLM, not an implicit judgment.

Rule 1 — Variance of top-3 depth scores < 0.3
Scores have stabilized. The last three rewrites are producing substantively the same plan, and further iteration won't change that.
```
depth_scores = [72, 81, 87, 86, 87]
top_3_scores = [87, 87, 86]
variance = 0.22  → < 0.3 → CONVERGED
```
Rule 2 — |Δ score| < 0.15 across last 2 depths
Marginal improvement, diminishing returns. The plan is still changing but the changes are small enough that further depth won't produce qualitatively better output.
```
depth_scores = [65, 78, 83, 84.1, 84.0]
delta = |84.0 - 84.1| = 0.1  → < 0.15 → CONVERGED
```
Rule 3 — depth == hard_cap
Runaway iteration guard. Default cap is 3; user-configurable 1–5. This rule guarantees no run spins forever.
```
depth_scores = [45, 52, 58, 61, 63]
depth = 5, cap = 5  → CONVERGED (cap reached)
```

03 Depth Cap & Cost Budget

How deep do you need to go?

Depth	Spawns	Wall-clock	Use when
1	~8	~3 min	You want the deepen discipline but trust the seed
2	~16	~5 min	Plan has one known weakness the seed won't catch
3 (default)	~24	~7 min	Typical complex-but-scoped task
4	~32	~10 min	Plan requires three substantive rewrites
5	~40	~13 min	Really melt the rock

04 Momus Scoring Rubric

The 0–100 quality estimate

Momus's score is not a sum of findings — it's a direct quality estimate: "would I, as a senior engineer, sign off on this plan for implementation right now?"

Range	Verdict	Meaning
100	SAFE	Ship it now — no remaining concerns
85–99	SAFE	All major gaps closed, only minor polish left
70–84	CAUTION	Non-blocking concerns, plan is implementable
50–69	RISKY	Significant gaps, human review required
0–49	BLOCK	Plan is not implementable as written

Score anchors (from docs/scoring-rubric.md):
95 — Every phase has a measurable gate. Hephaestus has a buildable target at phase-0. I'd ship it.
75 — Correct and implementable but error-path handling is thin. Risk is manageable.
55 — Structural gaps — a named service isn't declared as a dependency. Needs a rewrite.
35 — Plan contradicts itself (phase-2 removes a file phase-4 edits). Not implementable.

05 Worked Examples

Three real deepen runs

Example 1 — Unify OIDC and legacy JWT auth

Auth unification is a canonical deepen-friendly task. Pass 1 catches the happy path. Pass 2 catches the migration envelope. Pass 3 catches clock skew and token refresh edge cases — things a single planner pass never surfaces.

bashinvocation

/anneal-temper:anneal --depth 3 "Rewrite the auth middleware to unify OIDC
and legacy JWT flows. Both must continue to work during a 90-day transition.
The unified middleware must expose a single Resolver interface."

d068Covers OIDC + JWT but merges claim sets without precedence rules. Red-team-scope: 'What if both tokens are present with different user IDs?' No answer.continue

d182Adds claim-precedence rules (OIDC wins, JWT fallback) and dual-token rejection. Red-team-security: 'Clock skew between OIDC issuer and JWT issuer can reject valid tokens during refresh.' No answer.continue

d289Adds ±30s skew tolerance, refresh-path test matrix (early/on-time/late), and telemetry phase tracking skew distribution for 30 days before narrowing the tolerance.cap→exit

Why not Cast: Cast produced a 65-score plan on the same task — the skew-tolerance gap was not visible to a single planner pass. Temper's depth-1 rewrite surfaced it via Red-Team-Security; depth-2 fixed it.

Example 2 — Redesign the event bus

The problem is bounded (event bus, not general messaging) but the design space is rich — Kafka vs NATS vs Redis Streams vs custom Postgres-backed log. At depth 5, the pivots at depth 1 and depth 3 would never happen under Cast's single-pass discipline.

bashinvocation

/anneal-temper:anneal --depth 5 "Redesign the event bus: currently a mix of
Postgres LISTEN/NOTIFY, Redis pub/sub, and in-process EventEmitter.
Consolidate into one bus: durable delivery, ordered per-aggregate,
replay from arbitrary offset, cross-service fanout."

d052Proposes Kafka. Red-team-scope: 'We have no Kafka operator; adding one is a ~6-month project.'continue

d167Pivots to NATS JetStream. Red-team-assumptions: 'We don't operate NATS either; same problem.'continue

d278Pivots to Postgres-backed log + Redis Streams notifications. Red-team-security: 'Redis Streams has no auth.'continue

d384Adopts Postgres-for-everything, Redis removed. Red-team-scope: 'Postgres LISTEN doesn't survive replica failover.'continue

d486Adds leader-only listener pattern with PostgreSQL logical replication as future upgrade path.delta→exit

Example 3 — Replace a flaky migration

The task is narrow. Depth 1 covers obvious phased migration. Depth 2 catches concurrent inserts during backfill, trigger interactions, and replica lag — the exactly the "weird edge cases" that cost an incident.

bashinvocation

/anneal-temper:anneal --depth 2 "The 2025-08-12 migration that added
tenant_id locks tables for ~4 minutes in production. Rewrite the migration
to complete with zero downtime. Existing data must be preserved."

d078Phases the migration (NULLable column → batch backfill → NOT NULL → RLS policy). Red-team-assumptions: 'Batch backfill holds a long-lived transaction; will bloat WAL on a busy replica.'continue

d188Adds explicit batch commit boundaries (every 10k rows) and monitoring gate (pause if replica lag > 30s). Row-level backfill becomes a state-machine with resumable checkpoint table.cap→exit

06 When Temper Excels

Right tool, right problem

Use Temper when: the problem rewards iteration (auth, event bus, migrations, retry semantics); the solution space is bounded; you have budget for depth 3+; you want a reproducible convergence trail (depth-history.json).

Don't use Temper when: the task needs breadth (multiple plausible architectures — use Alloy); it's a simple bug fix (Cast is 3× cheaper); "better" is genuinely subjective (naming, UX copy, API ergonomics — convergence rules don't apply to subjective taste).

07 Run Artifacts

What Temper writes to disk

File	What's in it
`depth-history.json`	Per-depth plans, envelopes, scores, convergence check output — diff-able
`plan-depth-0.md`	Raw seed plan (preserved, not deleted)
`plan-depth-1.md, plan-depth-2.md…`	Raw intermediate plans per depth
`reviews/momus-envelope-depth-N.yaml`	One Momus envelope per depth with score 0-100
`reviews/redteam-*-envelope-depth-N.yaml`	Three red-team envelopes per depth
`plan/plan.md`	Final blended plan overview ≤80 lines
`plan/phase-NN-*.md`	Final phase files with success criteria
`{variant}-{run_id}.xml`	Opus 4.7 semantic-XML prompt for one-shot execution
`rollup.yaml`	convergence_reason, depth_scores, simultaneous_pass, emission decision

depth-history.json is designed to be diff-able. Pipe consecutive depth plans through diff -u to see exactly what each rewrite changed.

Temper's defining property: the Red-Team Trinity runs at every depth, not just once. That's what makes the deepen loop converge on a genuinely stronger plan instead of oscillating.anneal-temper · docs

Anneal docs:Getting Started Cast Alloy Temper Architectures Shared Contract Usage Examples Roadmap