Anneal · Docs

Shared Contract

Three architectures. One contract. The difference is stage 4. This is the cross-variant reference — the seven-stage spine, eight invariants, the envelope schema that glues the pipeline together, and the stage-4 divergence that makes Cast, Alloy, and Temper distinct while staying fully interchangeable.

01 The Seven-Stage Spine

Every run walks seven stages

Stages 1–3 and 5–7 are identical across Cast, Alloy, and Temper. Only stage 4 differs. The command files in each variant describe the same agents — Metis at stage 3, Red-Team Trinity at stage 5, Hephaestus at stage 6, Atlas at stage 7.

Stage 01

Intent Gate

Classify task Reject unsafe inputs

Stage 02

Probe

Scan codebase Enumerate skills Read docs

Stage 03

Enrich

Metis flags ambiguity Emits directives

Stage 04

Plan

VARIANT-SPECIFIC Cast / Alloy / Temper

Stage 05

Review

Red-Team Trinity (parallel) → Oracle

Stage 06

Validate

Hephaestus builds + exercises artifact

Stage 07

Emit / Re-loop

Atlas writes XML + plan dir · or · re-loops

Why only stage 4? Keeping stages 1–3, 5–7 identical means a run's quality is comparable across variants. Same probe, same Metis, same Red Team, same Oracle, same Hephaestus. If Alloy produces a stronger plan than Cast on the same task, the difference is genuinely in the tournament — not in upstream noise. New variants can be added at stage 4 alone, with no changes to stages 1–3 or 5–7.

Variant	Stage-4 shape
Cast	One Prometheus call. Momus audits the single plan.
Alloy	N parallel Prometheus-Alloy calls (biased) → Synthesizer blends → Momus audits the blend.
Temper	Prometheus-Temper rewrites once per depth; Red-Team Trinity runs at every depth; Momus scores 0–100 per depth; convergence-check.py decides exit.

02 The Eight Shared Invariants

Non-negotiable constraints

The commands refuse flags that would break these invariants. Invalid plans are re-looped, not accepted. These rules apply identically to every variant; none can be disabled.

INVARIANT 01

Red Team Trinity Always Runs

Security · Scope · Assumptions — three parallel adversaries. No flag disables this.

Cast — runs once at stage 5.
Alloy — runs once at stage 5 on the blend.
Temper — runs at every depth of the deepen loop.

Dispatch mechanics are load-bearing: in a single assistant message, emit three Task tool calls (one per adversary). Do not set run_in_background: true — that breaks the pipeline.

INVARIANT 02

Functional Validation Always Runs

Hephaestus builds the real artifact in a scratch worktree, captures real build output, real runtime output (CLI stdout/stderr, API responses, screenshots if UI), compares against the plan's success criteria, and returns PASS or FAIL with evidence cited.

Evidence quality rules — empty files are INVALID; "Build succeeded" without the log line is INVALID; screenshots of blank pages are INVALID. No mocks, no test files, no stubs. Ever.

INVARIANT 03

Dual Output — XML Prompt + Plan Directory

Every successful run produces two artifacts:

XML prompt — an Opus 4.7 semantic-XML prompt per _shared/opus-47-xml-schema.md. One-shot, designed to be pasted into a fresh Claude Code session.

Plan directory — plan/plan.md + plan/phase-NN-*.md for humans to review, edit, and share.

Both ship. The XML is for machines. The plan directory is for humans.

INVARIANT 04

Skill Enrichment

The probe stage scans ~/.claude/skills/ and the project's .claude/skills/. Matching skills inject automatically — if the project has a functional-validation skill, Hephaestus uses it; if the user has a scout skill, probe uses it.

Injection is name-based: skill names match semantic roles. No config required.

INVARIANT 05

Unbounded Re-loop on FAIL

Failure folds into the next run's constraints, not counted as a terminal state. Re-loop shape is variant-specific:

Cast — failure folds as a new Metis directive; Prometheus re-plans.
Alloy — full re-loop through Intent Gate; tournament re-runs with new directives.
Temper — reset depth = 0; deepen loop re-runs with augmented directives.

Default cap is 3 iterations; --loop flag lifts it to unbounded.

INVARIANT 06

Parallelization by Default

Parallelism is the default mode, not an optimisation:

Red Team Trinity always fans out — three Task calls in one message.
Alloy's N planners fan out via xargs -P $(sysctl -n hw.ncpu || nproc).
Temper's Red Team fans out even inside the deepen loop.

Sequential execution is a fallback, not a design choice.

INVARIANT 07

Category Routing, Not Model Picking

The user specifies --type ultrabrain | deep | quick. The harness maps the category to a model at runtime. Plugins do not hardcode model identifiers like claude-opus-4-7 — they declare category requirements, and the runtime resolves them.

This means model upgrades propagate automatically without touching plugin code.

INVARIANT 08

Dual Prompts by Model Family

Agents ship Claude-flavored and GPT-flavored prompts in the same SKILL.md / agent.md file. The runtime picks at dispatch time based on which model family the category resolves to.

This isolates prompt-format differences (tool-call syntax, reasoning-tag conventions) from the agent's semantic role. Agents are portable; formats are not.

03 The Envelope Schema

The shared data contract between agents

Every reviewer in Anneal — Metis, Momus, each Red-Team Trinity member, Oracle — returns an envelope in a shared schema. The schema lives in _shared/plan-reviewer-schema.md.

envelope schema · plan-reviewer-schema.mdyaml

<span class="kw">agent</span>: <span class="var">&lt;agent-name&gt;</span>                   <span class="c"># e.g. "metis", "redteam-security"</span>
<span class="kw">run_id</span>: <span class="var">&lt;run-id&gt;</span>
<span class="kw">depth</span>: <span class="var">&lt;int&gt;</span>                            <span class="c"># Temper only; optional</span>
<span class="kw">verdict</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
<span class="kw">score</span>: <span class="var">&lt;int 0-100&gt;</span>                     <span class="c"># Temper/Momus only; optional</span>
<span class="kw">summary</span>: <span class="var">&lt;string, ≤240 chars&gt;</span>
<span class="kw">findings</span>:
  - <span class="kw">location</span>: <span class="var">&lt;file-path or section&gt;</span>
    <span class="kw">severity</span>: <span class="cmd">critical</span> | <span class="cmd">high</span> | <span class="cmd">medium</span> | <span class="cmd">low</span>
    <span class="kw">concern</span>: <span class="var">&lt;string&gt;</span>
    <span class="kw">demand</span>: <span class="var">&lt;imperative sentence&gt;</span>       <span class="c"># what the planner must do</span>
<span class="kw">directives</span>:                               <span class="c"># Metis only; imperative sentences</span>
  - <span class="var">&lt;string&gt;</span>
<span class="kw">clarifying_questions</span>:                     <span class="c"># BLOCK verdicts only</span>
  - <span class="var">&lt;string&gt;</span>
<span class="kw">metadata</span>:
  <span class="kw">timestamp</span>: <span class="var">&lt;ISO-8601&gt;</span>
  <span class="kw">reviewer_model</span>: <span class="var">&lt;string&gt;</span>
  <span class="kw">token_cost</span>: <span class="var">&lt;int&gt;</span>

Verdict semantics

Verdict	Gate behavior
`SAFE`	Proceed. No concerns.
`CAUTION`	Proceed. Record findings; downstream reviewers and Oracle aggregate them.
`RISKY`	Proceed with explicit human override at the Oracle stage.
`BLOCK`	Do not proceed. If clarifying questions are present, surface and ABORT. Otherwise re-loop with findings folded as new Metis directives.

Overall verdict derivation: The overall verdict for a run is the worst across all envelopes in the final iteration: BLOCK > RISKY > CAUTION > SAFE. Hephaestus's PASS | FAIL maps to SAFE | BLOCK and participates in the same worst-of aggregation.

04 Rollup & Emission Decision

Atlas computes the final rollup

At stage 7, Atlas aggregates all envelopes from the current iteration into a rollup document. The rollup drives the emission decision: EMIT, RE_LOOP, or ABORT.

rollup schema · atlas stage-7 outputyaml

<span class="kw">rollup</span>:
  <span class="kw">run_id</span>: <span class="var">&lt;run-id&gt;</span>
  <span class="kw">architecture</span>: <span class="cmd">cast</span> | <span class="cmd">alloy</span> | <span class="cmd">temper</span>
  <span class="kw">overall_verdict</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
  <span class="kw">gate_status</span>:
    <span class="kw">metis</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
    <span class="kw">momus</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
    <span class="kw">red_team_trinity</span>: <span class="var">"N/3 PASS"</span>          <span class="c"># e.g. "3/3 PASS"</span>
    <span class="kw">oracle</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
    <span class="kw">hephaestus</span>: <span class="cmd">PASS</span> | <span class="cmd">FAIL</span>
  <span class="kw">simultaneous_pass</span>: <span class="var">&lt;bool&gt;</span>
  <span class="kw">emission_decision</span>: <span class="cmd">EMIT</span> | <span class="cmd">RE_LOOP</span> | <span class="cmd">ABORT</span>
  <span class="kw">iteration_count</span>: <span class="var">&lt;int&gt;</span>
  <span class="c"># variant-specific fields:</span>
  <span class="kw">depth_final</span>: <span class="var">&lt;int&gt;</span>                  <span class="c"># Temper only</span>
  <span class="kw">depth_scores</span>: <span class="var">[&lt;int&gt;, ...]</span>           <span class="c"># Temper only</span>
  <span class="kw">convergence_reason</span>: <span class="cmd">variance</span> | <span class="cmd">delta</span> | <span class="cmd">cap</span>  <span class="c"># Temper only</span>
  <span class="kw">bias_set</span>: <span class="var">[&lt;string&gt;, ...]</span>             <span class="c"># Alloy only</span>
  <span class="kw">synthesis_provenance</span>: <span class="var">&lt;path&gt;</span>          <span class="c"># Alloy only</span>

Emission logic

emission decision algorithmpseudocode

<span class="kw">if</span> simultaneous_pass == <span class="cmd">true</span>
   <span class="kw">AND</span> overall_verdict <span class="kw">in</span> {<span class="cmd">SAFE</span>, <span class="cmd">CAUTION</span>}:
    emission_decision = <span class="cmd">EMIT</span>

<span class="kw">elif</span> overall_verdict == <span class="cmd">BLOCK</span>
     <span class="kw">AND</span> Metis.clarifying_questions <span class="kw">is</span> non-empty:
    emission_decision = <span class="cmd">ABORT</span>   <span class="c"># surface questions, stop</span>

<span class="kw">else</span>:
    emission_decision = <span class="cmd">RE_LOOP</span>  <span class="c"># fold findings, run again</span>

simultaneous_pass is a coherence check, not a sign-off accumulator. A Momus that green-lit iteration 1 and a Hephaestus that passed iteration 2 do not combine to simultaneous_pass: true. Every gate must land green in the same iteration. If the plan drifted between iterations, the drift shows up here as a re-loop trigger.

05 Dispatch Mechanics

Parallel dispatches are load-bearing

The most common mistake in Anneal pipelines is misunderstanding how parallelism works. Dispatch mechanics appear in every command file for good reason.

The canonical dispatch note (paraphrased from every command file): In a SINGLE assistant message, emit three Task tool calls (one per Red-Team adversary). Do NOT set run_in_background: true on any of them — that makes them fire-and-forget and breaks the pipeline. The Task tool already executes multiple calls in one message concurrently; that is where the parallelism comes from. Wait for ALL THREE envelope responses before invoking Oracle. No partial reviews.

correct · parallel fan-out

<span class="c"># One message, three Task calls</span>
<span class="c"># Runtime executes concurrently</span>

<span class="cmd">Task</span>(<span class="kw">agent</span>=<span class="var">"redteam-security"</span>, ...)
<span class="cmd">Task</span>(<span class="kw">agent</span>=<span class="var">"redteam-scope"</span>, ...)
<span class="cmd">Task</span>(<span class="kw">agent</span>=<span class="var">"redteam-assumptions"</span>, ...)

<span class="c"># wait for all three</span>
<span class="cmd">Oracle</span>(envelopes=[s, sc, a])

wrong · fire-and-forget

<span class="c"># run_in_background: true breaks pipeline</span>
<span class="c"># dispatches return immediately (no result)</span>

<span class="cmd">Task</span>(<span class="var">"redteam-security"</span>,
     <span class="kw">run_in_background</span>=<span class="cmd">true</span>)  <span class="c"># ← WRONG</span>
<span class="cmd">Task</span>(<span class="var">"redteam-scope"</span>,
     <span class="kw">run_in_background</span>=<span class="cmd">true</span>)  <span class="c"># ← WRONG</span>

<span class="c"># Oracle gets empty inputs</span>
<span class="cmd">Oracle</span>()  <span class="c"># reports 3/3 PASS — all empty</span>

The guard against the wrong pattern is twofold: the explicit dispatch note in every command spec, and the emission gate's simultaneous_pass check. Empty envelopes fail the simultaneous-pass check because they cannot produce a valid verdict — the rollup triggers a re-loop, not an emit, surfacing the pipeline error rather than silently shipping an unreviewed plan.

The same single-message pattern applies to Alloy's N-variant fan-out. The orchestrator uses xargs -P for CLI parallelism at the shell level, but inside the agent execution context, multiple Task calls in one message is the primitive.

06 Stage-4 Divergence in Detail

Where the variants diverge

Cast — single pass

cast stage-4 pseudocodepseudocode

<span class="c"># Stage 4: single planner, single auditor</span>
<span class="cmd">Prometheus-Cast</span>(task, metis_directives, probe_report)
  → plan.md + phase-*.md

<span class="cmd">Momus</span>(plan) → envelope

<span class="kw">if</span> momus.verdict == <span class="cmd">BLOCK</span>:
    fold findings as Metis directive
    re-loop once                          <span class="c"># then escalate</span>

Alloy — tournament

alloy stage-4 pseudocodepseudocode

<span class="c"># Stage 4: N biased planners run in parallel</span>
bias_set = <span class="cmd">select_biases</span>(N)
<span class="c"># e.g. ["correctness","minimalist","defensive","performance","ux"]</span>

<span class="c"># parallel via xargs -P $(sysctl -n hw.ncpu || nproc)</span>
<span class="kw">for</span> bias <span class="kw">in</span> bias_set:
    <span class="cmd">Prometheus-Alloy</span>(task, metis_directives, probe_report,
                     <span class="kw">bias</span>=bias)
      → variant-{i}-{bias}.md

<span class="c"># wait for all N variants</span>

<span class="cmd">Synthesizer</span>(variants, metis_directives, probe_report)
  → plan.md + phase-*.md + synthesis-provenance.md

<span class="cmd">Momus</span>(plan) → envelope    <span class="c"># audits the blend, NOT the variants</span>

<span class="kw">if</span> momus.verdict == <span class="cmd">BLOCK</span>:
    regenerate tournament with Momus findings as constraints
    max 2 stage-4 re-loops, then escalate to full re-loop

Temper — deepen loop

temper stage-4 pseudocodepseudocode

<span class="c"># Stage 4: fixed-point deepen loop</span>
depth = 0
depth_scores = []

<span class="kw">loop</span>:
    <span class="kw">if</span> depth == 0:
        plan_0 = <span class="cmd">Prometheus-Temper</span>(task, metis_directives, probe_report)
    <span class="kw">else</span>:
        plan_N = <span class="cmd">Prometheus-Temper</span>(
            task, metis_directives, probe_report,
            <span class="kw">prior_plan</span>=plan_{N-1},
            <span class="kw">prior_momus</span>=momus_envelope_{N-1},
            <span class="kw">prior_redteam</span>=redteam_envelopes_{N-1},
            <span class="kw">depth_scores</span>=depth_scores
        )

    <span class="c"># Red Team fans out INSIDE the loop (3 Task calls, one message)</span>
    redteam_envelopes_N = [<span class="cmd">redteam-security</span>,
                           <span class="cmd">redteam-scope</span>,
                           <span class="cmd">redteam-assumptions</span>](plan_N)

    momus_envelope_N = <span class="cmd">Momus</span>(plan_N)      <span class="c"># includes score 0-100</span>
    depth_scores.append(momus_envelope_N.score)

    exit_code = <span class="cmd">convergence-check.py</span>(depth, depth_scores, cap=N)

    <span class="kw">if</span> exit_code == 0:
        <span class="kw">break</span>    <span class="c"># converged — exit is DETERMINISTIC, not LLM-decided</span>

    depth += 1

plan_final = plan_N

<span class="c"># On Hephaestus FAIL: reset depth = 0, route back to stage 3</span>

Temper's exit is deterministic. convergence-check.py decides — not an LLM. The three exit conditions are: score variance ≤ 3 over last two depths (stable plateau), score delta ≤ 2 between iterations (marginal gain), and depth ≥ cap (hard limit). On Hephaestus FAIL, Temper resets depth = 0 and routes back to stage 3 (Enrich) with the failure folded into Metis directives.

07 Prior Art

Credit where it's due

Anneal's architecture pulls from several sources. None of this was invented from scratch.

Source	What Anneal borrowed
oh-my-openagent	The Greek-god agent taxonomy (Metis, Momus, Oracle, Prometheus, Hephaestus, Atlas). Verdict tiers (SAFE / CAUTION / RISKY / BLOCK) and the parallel-agent review pattern are borrowed wholesale.
Aider	Terminal-first ergonomics. Zero-ceremony invocation. Anneal is plan-first rather than edit-first but shares the "just type and go" philosophy.
Ralph	The unbounded-re-loop discipline. "The boulder never stops." Anneal's stage-7 simultaneous-pass gate is Ralph-shaped: never emit a partial result, always loop until coherence.
SADD (context-engineering-kit)	The primitive vocabulary (`launch-sub-agent`, `do-in-parallel`, `do-and-judge`, `tree-of-thoughts`) that Temper in particular composes. The deepen loop is SADD's `do-and-judge` wrapped in a convergence check.
ValidationForge	Hephaestus is a ValidationForge runner. The evidence quality rules, the no-mocks mandate, and the preflight discipline all come from VF.
multi-agent-consensus	Alloy's tournament is intellectually adjacent to multi-agent-consensus. Where consensus runs three agents as a unanimous gate at execution time, Alloy runs N agents as a consensus-blend at planning time.

"New variants can be added at stage 4 alone, with no changes to stages 1–3 or 5–7. This is the same philosophy that makes Unix pipelines composable."Shared Contract · Anneal v0.1.0

Anneal docs:Getting Started Cast Alloy Temper Architectures Shared Contract Usage Examples Roadmap