Anneal · Variant

Alloy — Tournament Consensus

N biased planners compete in parallel. A synthesizer blends their best material into one plan, surfaces every tradeoff, and records which variant contributed which phase. Everything downstream is identical to Cast.

01 Bias Selection — The N Lenses

Seven lenses, one task

Every Prometheus-Alloy variant receives the same Metis directives and the same probe report — but a different lens. The lens is a single word that shifts what the planner optimizes for.

Bias	Optimizes for	Tends to produce
`correctness`	Exhaustive gate tests, phased rollout, every success criterion measurable	Long plans with explicit acceptance tests at every phase
`minimalist`	Smallest viable plan, strip ceremony, YAGNI to the bone	Short plans, few files, minimum viable phases
`defensive`	Rollback at every phase, checkpoint before risk, fail-safe defaults	Plans with backwards-compat envelopes and feature flags
`performance`	Vendor only what's used, prune speculative infra, hot-path awareness	Plans that remove things as often as they add things
`ux`	Status-line progress, helpful error messages, friendly failure paths	Plans with explicit error-message phases and telemetry
`verification`	Instrument-before-theorize, reproducibility, every claim a metric	Plans that add observability as a phase-0 prerequisite
`migration`	Every breaking change has a migration step, backwards compat, version gates	Plans with dual-read / dual-write transition phases

N selection: configurable via --versions N in range [2, 7]. Default is 5. The plugin refuses --versions 1 ("use Cast for single-planner work") and --versions 8+ ("synthesizer signal-to-noise collapses beyond 7").

N	Biases included
2	correctness, minimalist
3	correctness, minimalist, defensive
5 (default)	correctness, minimalist, defensive, performance, ux
7	all five plus verification and migration

02 Scoring Rubric

What the Synthesizer evaluates

The Synthesizer reads all N variant plans and scores each section against a shared rubric. It does not pick one variant as the winner — it picks the strongest section from each.

Dimension	What counts as strong
Correctness	Every phase has an explicit success criterion and a measurable gate
Completeness	No implicit dependencies; every required service, env var, permission named
Risk posture	Breaking changes flagged, rollback strategy explicit, feature flags where risk is real
Scope hygiene	Out-of-scope section present and honest. No "we could also…" creep
Evidence plan	Hephaestus can actually validate Phase N — it has a buildable target
Sequencing	Phases topologically ordered by dependency; no forward references

03 Synthesis Algorithm

How the blend happens

The Synthesizer folds N variants in three passes. Each pass has a distinct goal: align structure, resolve contradictions, record provenance.

Pass 1 — Structural alignment
All N variants share the same section structure because they all received the same Metis directives. The Synthesizer builds a phase-by-phase comparison matrix and picks the strongest version of each phase across variants. Example: for Phase 0 (preflight), the defensive variant's structure is adopted, with the UX variant's user-facing error path folded in.
Pass 2 — Contradiction resolution
When variants disagree — e.g. "use Redis for presence" vs. "use Postgres LISTEN/NOTIFY" — the Synthesizer falls back to Metis directives. If Metis named a service, pick that. If Metis was silent, prefer the variant that matches the probe report's detected stack. If still tied, surface the contradiction in synthesis-provenance.md and let the human decide.
Pass 3 — Provenance attribution
Every phase in the blended plan carries a synthesis-provenance annotation. It names the primary source variant, any secondary source, and lists contradictions with their resolution rationale. Provenance exists for audit: if the blended plan is wrong, you need to know which variant introduced the defect.

synthesis-provenance.md (excerpt)sidecar file

phase-00-preflight:
  primary_source: variant-3-defensive
  secondary_source: variant-5-ux (error-path section)
  contradictions: []

phase-03-tenant-isolation-migration:
  primary_source: variant-7-migration
  secondary_source: variant-1-correctness (gate-test section)
  contradictions:
    - variant-2-minimalist recommended single-transaction migration;
      rejected per Metis directive "phase the migration to avoid table lock"

04 Cost Profile

Budget before you commit

Metric	N=3	N=5 (default)	N=7
Agent spawns	~14	~18	~22
Wall-clock	~5 min	~6 min	~8 min
Worst case (re-loops)	~18 min	~22 min	~28 min
Token cost (approx)	~$0.80	~$1.20	~$1.70
Disk per run	~3–7 MB	~5–10 MB	~7–14 MB

Alloy is 3–5× more expensive than Cast per run. The cost is justified when the plan shape is genuinely non-obvious. If the scope is clear and tradeoffs are known, run Cast instead.

05 Worked Examples

Three real Alloy runs

Example 1 — Design a plugin system for a CLI

Plugin-system design has at least four reasonable shapes — in-process, out-of-process subprocess, WASM sandbox, or a manifest-and-registry approach. A single planner will pick one and commit; the tournament surfaces the tradeoffs.

bashinvocation

/anneal-alloy:anneal "Design a plugin system for the CLI with versioned
lifecycle hooks, sandboxed execution, and a plugin discovery marketplace.
Plugins should be installable by name, scoped to a user or project, and
should expose commands, skills, and hooks."

synthesis-provenance.mdexcerpt

phase-04-sandboxed-execution:
  primary_source: variant-3-defensive (WASM-based sandbox)
  secondary_source: variant-4-performance (in-process when plugin is trusted)
  contradictions:
    - variant-1-correctness: mandatory sandbox
    - variant-4-performance: trust-level escape hatch
    → resolved: WASM is default, opt-in escape hatch via manifest flag,
      Oracle will flag escape hatch as deployment risk

phase-06-marketplace-discovery:
  primary_source: variant-5-ux
  secondary_source: variant-1-correctness (signature verification)
  contradictions: []

Output quality vs Cast: Cast on the same task produces a 5-phase plan with an in-process architecture. Alloy's blend produces a 7-phase plan with WASM sandboxing, a signed manifest, and an explicit escape hatch — a shape no single bias would have produced alone.

Example 2 — Replace REST API with GraphQL incrementally

High-stakes, low-reversibility, multiple reasonable approaches (strangler fig, dual-read/dual-write, facade-over-REST). The migration and verification biases specifically earn their keep at N=7.

bashinvocation

/anneal-alloy:anneal --versions 7 "Replace the existing REST API with GraphQL
incrementally. We have 43 REST endpoints in production serving 2.1M req/day,
no downtime tolerance, and six client teams that each control their own
migration timeline."

yamlMomus auditing the blend — verdict: CAUTION

verdict: CAUTION
findings:
  - issue: dual-read phase duration is underspecified
    severity: high
    demand: "Name the exact metric and value: 'cutover when GraphQL p95
             latency < REST p95 latency AND error_rate_delta < 0.1%
             over 7 days.'"
  - issue: client teams' rollout order is not sequenced
    severity: medium
    demand: "Add phase-N-client-rollout-sequencing.md — pin the order
             (start with lowest-traffic client, ratchet to highest)."
  - issue: schema diff tooling is not named
    severity: medium
    demand: "Pin the tool, or declare the contract-test approach."

The blend produces a 9-phase plan with explicit dual-read metrics, client rollout sequencing, and schema-diff tooling pinned to graphql-inspector. Temper on the same task would have converged after two or three depths on a similar shape; Alloy reaches it faster because the biases force the tradeoffs to surface in parallel.

Example 3 — Build a workflow orchestration engine

Pure greenfield. No existing infrastructure in the probe report to anchor on. The plan shape is open — event-sourced vs log-structured, SQL vs KV store, single-process vs distributed. This is where Alloy's breadth advantage is widest.

bashinvocation

/anneal-alloy:anneal "Build a workflow orchestration engine: durable execution,
step-level retries with backoff, human-in-the-loop pauses, and a web UI to
inspect runs. Inspired by Temporal but we want to own the code."

Variant	Plan phases	Shape produced
Cast (single planner)	6	Event-sourced, Postgres-backed, single-process, Next.js UI
Alloy (N=5)	9	Event-sourced + SINGLE_NODE mode + cluster-aware + distinct UI app — a shape no single bias would have produced alone
Temper (depth=3)	7	Event-sourced, Postgres-backed, single-process — progressively hardened retry semantics

06 Re-loop Behavior

On FAIL, route to Intent Gate

Alloy's re-loop is smarter than Cast's. On FAIL, it routes to Intent Gate, not to the Synthesizer. A failed synthesis suggests the bias mix was wrong — re-synthesizing the same N variants produces the same blend. Re-looping through Intent Gate gives Metis a chance to refine directives and lets the orchestrator pick a different N or a different bias set.

Re-loop type	Trigger	Max iterations
Stage-4 re-loop	Momus returns BLOCK on the blend; tournament re-runs with findings as constraints	2 stage-4 re-loops before escalating to full re-loop
Full re-loop	Hephaestus FAIL; routes through Intent Gate with failure folded as directives	3 by default; `--loop` lifts to unbounded

07 Limitations

Know the edges

N is capped at 7 for a reason
The synthesizer's attention budget is finite; beyond 7 variants the blend starts to average instead of integrate. If your task needs N=10 you're probably using the wrong variant — try Temper with depth 5 instead.
The synthesizer is a single agent
If it makes an integration mistake, all N variants' quality is wasted. Momus is the backstop, but it only audits the blend, not the synthesizer's reasoning. Read synthesis-provenance.md whenever a blended plan feels off.
Bias selection is heuristic, not optimal
The N=5 default is a good general-purpose set, but some tasks would benefit from a custom bias list (e.g. "correctness + migration + verification" for a schema migration). Custom bias lists are a v0.3.0 feature.
Tournament parallelism requires a real shell
xargs -P is portable but some minimal container images omit it. If both nproc and sysctl -n hw.ncpu fail, the orchestrator falls back to sequential — and Alloy's wall-clock doubles.
The synthesizer can invent phases
In ~3% of runs the synthesizer produces a phase not present in any variant — it invented it from contradictions. These phases are flagged in synthesis-provenance.md as primary_source: synthesizer with no secondary source. Read them skeptically.

08 Run Artifacts

What Alloy writes to disk

Alloy writes everything Cast writes, plus the variant files and provenance sidecar. Preserve the variants/ directory — it's the richest learning signal in any Alloy run.

File	What's in it
`plan/plan.md`	Overview ≤80 lines, status, effort, dependencies
`plan/phase-NN-*.md`	Detailed phase files with success criteria and Hephaestus targets
`rollup.yaml`	All envelopes, gate statuses, simultaneous_pass, emission decision
`{variant}-{run_id}.xml`	Opus 4.7 semantic-XML prompt for one-shot execution
`variants/variant-1-correctness.md`	Raw output from the correctness-biased planner
`variants/variant-2-minimalist.md`	Raw output from the minimalist-biased planner
`variants/variant-N-{bias}.md`	One file per bias
`synthesis-provenance.md`	Per-phase attribution, contradictions, resolution rationale

The tournament exists to reduce planning bias. N biased planners stretch the solution space on purpose. The synthesizer's job is to pick the strongest element from each and fold them into one coherent plan.anneal-alloy · docs

Anneal docs:Getting Started Cast Alloy Temper Architectures Shared Contract Usage Examples Roadmap