Skip to main content
Anneal · Docs

Shared Contract

Three architectures. One contract. The difference is stage 4. This is the cross-variant reference — the seven-stage spine, eight invariants, the envelope schema that glues the pipeline together, and the stage-4 divergence that makes Cast, Alloy, and Temper distinct while staying fully interchangeable.

Every run walks seven stages

Stages 1–3 and 5–7 are identical across Cast, Alloy, and Temper. Only stage 4 differs. The command files in each variant describe the same agents — Metis at stage 3, Red-Team Trinity at stage 5, Hephaestus at stage 6, Atlas at stage 7.

Stage 01
Intent Gate
Classify task Reject unsafe inputs
Stage 02
Probe
Scan codebase Enumerate skills Read docs
Stage 03
Enrich
Metis flags ambiguity Emits directives
Stage 04
Plan
VARIANT-SPECIFIC Cast / Alloy / Temper
Stage 05
Review
Red-Team Trinity (parallel) → Oracle
Stage 06
Validate
Hephaestus builds + exercises artifact
Stage 07
Emit / Re-loop
Atlas writes XML + plan dir · or · re-loops
Why only stage 4? Keeping stages 1–3, 5–7 identical means a run's quality is comparable across variants. Same probe, same Metis, same Red Team, same Oracle, same Hephaestus. If Alloy produces a stronger plan than Cast on the same task, the difference is genuinely in the tournament — not in upstream noise. New variants can be added at stage 4 alone, with no changes to stages 1–3 or 5–7.
VariantStage-4 shape
CastOne Prometheus call. Momus audits the single plan.
AlloyN parallel Prometheus-Alloy calls (biased) → Synthesizer blends → Momus audits the blend.
TemperPrometheus-Temper rewrites once per depth; Red-Team Trinity runs at every depth; Momus scores 0–100 per depth; convergence-check.py decides exit.

Non-negotiable constraints

The commands refuse flags that would break these invariants. Invalid plans are re-looped, not accepted. These rules apply identically to every variant; none can be disabled.

INVARIANT 01
Red Team Trinity Always Runs
Security · Scope · Assumptions — three parallel adversaries. No flag disables this.

Cast — runs once at stage 5.
Alloy — runs once at stage 5 on the blend.
Temper — runs at every depth of the deepen loop.

Dispatch mechanics are load-bearing: in a single assistant message, emit three Task tool calls (one per adversary). Do not set run_in_background: true — that breaks the pipeline.
INVARIANT 02
Functional Validation Always Runs
Hephaestus builds the real artifact in a scratch worktree, captures real build output, real runtime output (CLI stdout/stderr, API responses, screenshots if UI), compares against the plan's success criteria, and returns PASS or FAIL with evidence cited.

Evidence quality rules — empty files are INVALID; "Build succeeded" without the log line is INVALID; screenshots of blank pages are INVALID. No mocks, no test files, no stubs. Ever.
INVARIANT 03
Dual Output — XML Prompt + Plan Directory
Every successful run produces two artifacts:

XML prompt — an Opus 4.7 semantic-XML prompt per _shared/opus-47-xml-schema.md. One-shot, designed to be pasted into a fresh Claude Code session.

Plan directoryplan/plan.md + plan/phase-NN-*.md for humans to review, edit, and share.

Both ship. The XML is for machines. The plan directory is for humans.
INVARIANT 04
Skill Enrichment
The probe stage scans ~/.claude/skills/ and the project's .claude/skills/. Matching skills inject automatically — if the project has a functional-validation skill, Hephaestus uses it; if the user has a scout skill, probe uses it.

Injection is name-based: skill names match semantic roles. No config required.
INVARIANT 05
Unbounded Re-loop on FAIL
Failure folds into the next run's constraints, not counted as a terminal state. Re-loop shape is variant-specific:

Cast — failure folds as a new Metis directive; Prometheus re-plans.
Alloy — full re-loop through Intent Gate; tournament re-runs with new directives.
Temper — reset depth = 0; deepen loop re-runs with augmented directives.

Default cap is 3 iterations; --loop flag lifts it to unbounded.
INVARIANT 06
Parallelization by Default
Parallelism is the default mode, not an optimisation:

Red Team Trinity always fans out — three Task calls in one message.
Alloy's N planners fan out via xargs -P $(sysctl -n hw.ncpu || nproc).
Temper's Red Team fans out even inside the deepen loop.

Sequential execution is a fallback, not a design choice.
INVARIANT 07
Category Routing, Not Model Picking
The user specifies --type ultrabrain | deep | quick. The harness maps the category to a model at runtime. Plugins do not hardcode model identifiers like claude-opus-4-7 — they declare category requirements, and the runtime resolves them.

This means model upgrades propagate automatically without touching plugin code.
INVARIANT 08
Dual Prompts by Model Family
Agents ship Claude-flavored and GPT-flavored prompts in the same SKILL.md / agent.md file. The runtime picks at dispatch time based on which model family the category resolves to.

This isolates prompt-format differences (tool-call syntax, reasoning-tag conventions) from the agent's semantic role. Agents are portable; formats are not.

The shared data contract between agents

Every reviewer in Anneal — Metis, Momus, each Red-Team Trinity member, Oracle — returns an envelope in a shared schema. The schema lives in _shared/plan-reviewer-schema.md.

envelope schema · plan-reviewer-schema.mdyaml
<span class="kw">agent</span>: <span class="var">&lt;agent-name&gt;</span>                   <span class="c"># e.g. "metis", "redteam-security"</span>
<span class="kw">run_id</span>: <span class="var">&lt;run-id&gt;</span>
<span class="kw">depth</span>: <span class="var">&lt;int&gt;</span>                            <span class="c"># Temper only; optional</span>
<span class="kw">verdict</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
<span class="kw">score</span>: <span class="var">&lt;int 0-100&gt;</span>                     <span class="c"># Temper/Momus only; optional</span>
<span class="kw">summary</span>: <span class="var">&lt;string, ≤240 chars&gt;</span>
<span class="kw">findings</span>:
  - <span class="kw">location</span>: <span class="var">&lt;file-path or section&gt;</span>
    <span class="kw">severity</span>: <span class="cmd">critical</span> | <span class="cmd">high</span> | <span class="cmd">medium</span> | <span class="cmd">low</span>
    <span class="kw">concern</span>: <span class="var">&lt;string&gt;</span>
    <span class="kw">demand</span>: <span class="var">&lt;imperative sentence&gt;</span>       <span class="c"># what the planner must do</span>
<span class="kw">directives</span>:                               <span class="c"># Metis only; imperative sentences</span>
  - <span class="var">&lt;string&gt;</span>
<span class="kw">clarifying_questions</span>:                     <span class="c"># BLOCK verdicts only</span>
  - <span class="var">&lt;string&gt;</span>
<span class="kw">metadata</span>:
  <span class="kw">timestamp</span>: <span class="var">&lt;ISO-8601&gt;</span>
  <span class="kw">reviewer_model</span>: <span class="var">&lt;string&gt;</span>
  <span class="kw">token_cost</span>: <span class="var">&lt;int&gt;</span>

Verdict semantics

VerdictGate behavior
SAFEProceed. No concerns.
CAUTIONProceed. Record findings; downstream reviewers and Oracle aggregate them.
RISKYProceed with explicit human override at the Oracle stage.
BLOCKDo not proceed. If clarifying questions are present, surface and ABORT. Otherwise re-loop with findings folded as new Metis directives.
Overall verdict derivation: The overall verdict for a run is the worst across all envelopes in the final iteration: BLOCK > RISKY > CAUTION > SAFE. Hephaestus's PASS | FAIL maps to SAFE | BLOCK and participates in the same worst-of aggregation.

Atlas computes the final rollup

At stage 7, Atlas aggregates all envelopes from the current iteration into a rollup document. The rollup drives the emission decision: EMIT, RE_LOOP, or ABORT.

rollup schema · atlas stage-7 outputyaml
<span class="kw">rollup</span>:
  <span class="kw">run_id</span>: <span class="var">&lt;run-id&gt;</span>
  <span class="kw">architecture</span>: <span class="cmd">cast</span> | <span class="cmd">alloy</span> | <span class="cmd">temper</span>
  <span class="kw">overall_verdict</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
  <span class="kw">gate_status</span>:
    <span class="kw">metis</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
    <span class="kw">momus</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
    <span class="kw">red_team_trinity</span>: <span class="var">"N/3 PASS"</span>          <span class="c"># e.g. "3/3 PASS"</span>
    <span class="kw">oracle</span>: <span class="cmd">SAFE</span> | <span class="cmd">CAUTION</span> | <span class="cmd">RISKY</span> | <span class="cmd">BLOCK</span>
    <span class="kw">hephaestus</span>: <span class="cmd">PASS</span> | <span class="cmd">FAIL</span>
  <span class="kw">simultaneous_pass</span>: <span class="var">&lt;bool&gt;</span>
  <span class="kw">emission_decision</span>: <span class="cmd">EMIT</span> | <span class="cmd">RE_LOOP</span> | <span class="cmd">ABORT</span>
  <span class="kw">iteration_count</span>: <span class="var">&lt;int&gt;</span>
  <span class="c"># variant-specific fields:</span>
  <span class="kw">depth_final</span>: <span class="var">&lt;int&gt;</span>                  <span class="c"># Temper only</span>
  <span class="kw">depth_scores</span>: <span class="var">[&lt;int&gt;, ...]</span>           <span class="c"># Temper only</span>
  <span class="kw">convergence_reason</span>: <span class="cmd">variance</span> | <span class="cmd">delta</span> | <span class="cmd">cap</span>  <span class="c"># Temper only</span>
  <span class="kw">bias_set</span>: <span class="var">[&lt;string&gt;, ...]</span>             <span class="c"># Alloy only</span>
  <span class="kw">synthesis_provenance</span>: <span class="var">&lt;path&gt;</span>          <span class="c"># Alloy only</span>

Emission logic

emission decision algorithmpseudocode
<span class="kw">if</span> simultaneous_pass == <span class="cmd">true</span>
   <span class="kw">AND</span> overall_verdict <span class="kw">in</span> {<span class="cmd">SAFE</span>, <span class="cmd">CAUTION</span>}:
    emission_decision = <span class="cmd">EMIT</span>

<span class="kw">elif</span> overall_verdict == <span class="cmd">BLOCK</span>
     <span class="kw">AND</span> Metis.clarifying_questions <span class="kw">is</span> non-empty:
    emission_decision = <span class="cmd">ABORT</span>   <span class="c"># surface questions, stop</span>

<span class="kw">else</span>:
    emission_decision = <span class="cmd">RE_LOOP</span>  <span class="c"># fold findings, run again</span>
simultaneous_pass is a coherence check, not a sign-off accumulator. A Momus that green-lit iteration 1 and a Hephaestus that passed iteration 2 do not combine to simultaneous_pass: true. Every gate must land green in the same iteration. If the plan drifted between iterations, the drift shows up here as a re-loop trigger.

Parallel dispatches are load-bearing

The most common mistake in Anneal pipelines is misunderstanding how parallelism works. Dispatch mechanics appear in every command file for good reason.

The canonical dispatch note (paraphrased from every command file): In a SINGLE assistant message, emit three Task tool calls (one per Red-Team adversary). Do NOT set run_in_background: true on any of them — that makes them fire-and-forget and breaks the pipeline. The Task tool already executes multiple calls in one message concurrently; that is where the parallelism comes from. Wait for ALL THREE envelope responses before invoking Oracle. No partial reviews.
correct · parallel fan-out
<span class="c"># One message, three Task calls</span>
<span class="c"># Runtime executes concurrently</span>

<span class="cmd">Task</span>(<span class="kw">agent</span>=<span class="var">"redteam-security"</span>, ...)
<span class="cmd">Task</span>(<span class="kw">agent</span>=<span class="var">"redteam-scope"</span>, ...)
<span class="cmd">Task</span>(<span class="kw">agent</span>=<span class="var">"redteam-assumptions"</span>, ...)

<span class="c"># wait for all three</span>
<span class="cmd">Oracle</span>(envelopes=[s, sc, a])
wrong · fire-and-forget
<span class="c"># run_in_background: true breaks pipeline</span>
<span class="c"># dispatches return immediately (no result)</span>

<span class="cmd">Task</span>(<span class="var">"redteam-security"</span>,
     <span class="kw">run_in_background</span>=<span class="cmd">true</span>)  <span class="c"># ← WRONG</span>
<span class="cmd">Task</span>(<span class="var">"redteam-scope"</span>,
     <span class="kw">run_in_background</span>=<span class="cmd">true</span>)  <span class="c"># ← WRONG</span>

<span class="c"># Oracle gets empty inputs</span>
<span class="cmd">Oracle</span>()  <span class="c"># reports 3/3 PASS — all empty</span>

The guard against the wrong pattern is twofold: the explicit dispatch note in every command spec, and the emission gate's simultaneous_pass check. Empty envelopes fail the simultaneous-pass check because they cannot produce a valid verdict — the rollup triggers a re-loop, not an emit, surfacing the pipeline error rather than silently shipping an unreviewed plan.

The same single-message pattern applies to Alloy's N-variant fan-out. The orchestrator uses xargs -P for CLI parallelism at the shell level, but inside the agent execution context, multiple Task calls in one message is the primitive.

Where the variants diverge

Cast — single pass

cast stage-4 pseudocodepseudocode
<span class="c"># Stage 4: single planner, single auditor</span>
<span class="cmd">Prometheus-Cast</span>(task, metis_directives, probe_report)
  → plan.md + phase-*.md

<span class="cmd">Momus</span>(plan) → envelope

<span class="kw">if</span> momus.verdict == <span class="cmd">BLOCK</span>:
    fold findings as Metis directive
    re-loop once                          <span class="c"># then escalate</span>

Alloy — tournament

alloy stage-4 pseudocodepseudocode
<span class="c"># Stage 4: N biased planners run in parallel</span>
bias_set = <span class="cmd">select_biases</span>(N)
<span class="c"># e.g. ["correctness","minimalist","defensive","performance","ux"]</span>

<span class="c"># parallel via xargs -P $(sysctl -n hw.ncpu || nproc)</span>
<span class="kw">for</span> bias <span class="kw">in</span> bias_set:
    <span class="cmd">Prometheus-Alloy</span>(task, metis_directives, probe_report,
                     <span class="kw">bias</span>=bias)
      → variant-{i}-{bias}.md

<span class="c"># wait for all N variants</span>

<span class="cmd">Synthesizer</span>(variants, metis_directives, probe_report)
  → plan.md + phase-*.md + synthesis-provenance.md

<span class="cmd">Momus</span>(plan) → envelope    <span class="c"># audits the blend, NOT the variants</span>

<span class="kw">if</span> momus.verdict == <span class="cmd">BLOCK</span>:
    regenerate tournament with Momus findings as constraints
    max 2 stage-4 re-loops, then escalate to full re-loop

Temper — deepen loop

temper stage-4 pseudocodepseudocode
<span class="c"># Stage 4: fixed-point deepen loop</span>
depth = 0
depth_scores = []

<span class="kw">loop</span>:
    <span class="kw">if</span> depth == 0:
        plan_0 = <span class="cmd">Prometheus-Temper</span>(task, metis_directives, probe_report)
    <span class="kw">else</span>:
        plan_N = <span class="cmd">Prometheus-Temper</span>(
            task, metis_directives, probe_report,
            <span class="kw">prior_plan</span>=plan_{N-1},
            <span class="kw">prior_momus</span>=momus_envelope_{N-1},
            <span class="kw">prior_redteam</span>=redteam_envelopes_{N-1},
            <span class="kw">depth_scores</span>=depth_scores
        )

    <span class="c"># Red Team fans out INSIDE the loop (3 Task calls, one message)</span>
    redteam_envelopes_N = [<span class="cmd">redteam-security</span>,
                           <span class="cmd">redteam-scope</span>,
                           <span class="cmd">redteam-assumptions</span>](plan_N)

    momus_envelope_N = <span class="cmd">Momus</span>(plan_N)      <span class="c"># includes score 0-100</span>
    depth_scores.append(momus_envelope_N.score)

    exit_code = <span class="cmd">convergence-check.py</span>(depth, depth_scores, cap=N)

    <span class="kw">if</span> exit_code == 0:
        <span class="kw">break</span>    <span class="c"># converged — exit is DETERMINISTIC, not LLM-decided</span>

    depth += 1

plan_final = plan_N

<span class="c"># On Hephaestus FAIL: reset depth = 0, route back to stage 3</span>
Temper's exit is deterministic. convergence-check.py decides — not an LLM. The three exit conditions are: score variance ≤ 3 over last two depths (stable plateau), score delta ≤ 2 between iterations (marginal gain), and depth ≥ cap (hard limit). On Hephaestus FAIL, Temper resets depth = 0 and routes back to stage 3 (Enrich) with the failure folded into Metis directives.

Credit where it's due

Anneal's architecture pulls from several sources. None of this was invented from scratch.

SourceWhat Anneal borrowed
oh-my-openagentThe Greek-god agent taxonomy (Metis, Momus, Oracle, Prometheus, Hephaestus, Atlas). Verdict tiers (SAFE / CAUTION / RISKY / BLOCK) and the parallel-agent review pattern are borrowed wholesale.
AiderTerminal-first ergonomics. Zero-ceremony invocation. Anneal is plan-first rather than edit-first but shares the "just type and go" philosophy.
RalphThe unbounded-re-loop discipline. "The boulder never stops." Anneal's stage-7 simultaneous-pass gate is Ralph-shaped: never emit a partial result, always loop until coherence.
SADD (context-engineering-kit)The primitive vocabulary (launch-sub-agent, do-in-parallel, do-and-judge, tree-of-thoughts) that Temper in particular composes. The deepen loop is SADD's do-and-judge wrapped in a convergence check.
ValidationForgeHephaestus is a ValidationForge runner. The evidence quality rules, the no-mocks mandate, and the preflight discipline all come from VF.
multi-agent-consensusAlloy's tournament is intellectually adjacent to multi-agent-consensus. Where consensus runs three agents as a unanimous gate at execution time, Alloy runs N agents as a consensus-blend at planning time.
"New variants can be added at stage 4 alone, with no changes to stages 1–3 or 5–7. This is the same philosophy that makes Unix pipelines composable."Shared Contract · Anneal v0.1.0