Skip to main content
Anneal · Variant

Alloy — Tournament Consensus

N biased planners compete in parallel. A synthesizer blends their best material into one plan, surfaces every tradeoff, and records which variant contributed which phase. Everything downstream is identical to Cast.

METISStage 3Prometheus-Alloybias=correctnessPrometheus-Alloybias=minimalistPrometheus-Alloybias=defensivePrometheus-Alloybias=performancePrometheus-Alloybias=uxxargs -P (parallel)SYNTHESIZERblend + provenanceMOMUSaudits blendATLASemit / re-loop

Seven lenses, one task

Every Prometheus-Alloy variant receives the same Metis directives and the same probe report — but a different lens. The lens is a single word that shifts what the planner optimizes for.

BiasOptimizes forTends to produce
correctnessExhaustive gate tests, phased rollout, every success criterion measurableLong plans with explicit acceptance tests at every phase
minimalistSmallest viable plan, strip ceremony, YAGNI to the boneShort plans, few files, minimum viable phases
defensiveRollback at every phase, checkpoint before risk, fail-safe defaultsPlans with backwards-compat envelopes and feature flags
performanceVendor only what's used, prune speculative infra, hot-path awarenessPlans that remove things as often as they add things
uxStatus-line progress, helpful error messages, friendly failure pathsPlans with explicit error-message phases and telemetry
verificationInstrument-before-theorize, reproducibility, every claim a metricPlans that add observability as a phase-0 prerequisite
migrationEvery breaking change has a migration step, backwards compat, version gatesPlans with dual-read / dual-write transition phases
N selection: configurable via --versions N in range [2, 7]. Default is 5. The plugin refuses --versions 1 ("use Cast for single-planner work") and --versions 8+ ("synthesizer signal-to-noise collapses beyond 7").
NBiases included
2correctness, minimalist
3correctness, minimalist, defensive
5 (default)correctness, minimalist, defensive, performance, ux
7all five plus verification and migration

What the Synthesizer evaluates

The Synthesizer reads all N variant plans and scores each section against a shared rubric. It does not pick one variant as the winner — it picks the strongest section from each.

DimensionWhat counts as strong
CorrectnessEvery phase has an explicit success criterion and a measurable gate
CompletenessNo implicit dependencies; every required service, env var, permission named
Risk postureBreaking changes flagged, rollback strategy explicit, feature flags where risk is real
Scope hygieneOut-of-scope section present and honest. No "we could also…" creep
Evidence planHephaestus can actually validate Phase N — it has a buildable target
SequencingPhases topologically ordered by dependency; no forward references

How the blend happens

The Synthesizer folds N variants in three passes. Each pass has a distinct goal: align structure, resolve contradictions, record provenance.

  1. Pass 1 — Structural alignment
    All N variants share the same section structure because they all received the same Metis directives. The Synthesizer builds a phase-by-phase comparison matrix and picks the strongest version of each phase across variants. Example: for Phase 0 (preflight), the defensive variant's structure is adopted, with the UX variant's user-facing error path folded in.
  2. Pass 2 — Contradiction resolution
    When variants disagree — e.g. "use Redis for presence" vs. "use Postgres LISTEN/NOTIFY" — the Synthesizer falls back to Metis directives. If Metis named a service, pick that. If Metis was silent, prefer the variant that matches the probe report's detected stack. If still tied, surface the contradiction in synthesis-provenance.md and let the human decide.
  3. Pass 3 — Provenance attribution
    Every phase in the blended plan carries a synthesis-provenance annotation. It names the primary source variant, any secondary source, and lists contradictions with their resolution rationale. Provenance exists for audit: if the blended plan is wrong, you need to know which variant introduced the defect.
synthesis-provenance.md (excerpt)sidecar file
phase-00-preflight:
  primary_source: variant-3-defensive
  secondary_source: variant-5-ux (error-path section)
  contradictions: []

phase-03-tenant-isolation-migration:
  primary_source: variant-7-migration
  secondary_source: variant-1-correctness (gate-test section)
  contradictions:
    - variant-2-minimalist recommended single-transaction migration;
      rejected per Metis directive "phase the migration to avoid table lock"

Budget before you commit

MetricN=3N=5 (default)N=7
Agent spawns~14~18~22
Wall-clock~5 min~6 min~8 min
Worst case (re-loops)~18 min~22 min~28 min
Token cost (approx)~$0.80~$1.20~$1.70
Disk per run~3–7 MB~5–10 MB~7–14 MB
Alloy is 3–5× more expensive than Cast per run. The cost is justified when the plan shape is genuinely non-obvious. If the scope is clear and tradeoffs are known, run Cast instead.

Three real Alloy runs

Example 1 — Design a plugin system for a CLI

Plugin-system design has at least four reasonable shapes — in-process, out-of-process subprocess, WASM sandbox, or a manifest-and-registry approach. A single planner will pick one and commit; the tournament surfaces the tradeoffs.

bashinvocation
/anneal-alloy:anneal "Design a plugin system for the CLI with versioned
lifecycle hooks, sandboxed execution, and a plugin discovery marketplace.
Plugins should be installable by name, scoped to a user or project, and
should expose commands, skills, and hooks."
synthesis-provenance.mdexcerpt
phase-04-sandboxed-execution:
  primary_source: variant-3-defensive (WASM-based sandbox)
  secondary_source: variant-4-performance (in-process when plugin is trusted)
  contradictions:
    - variant-1-correctness: mandatory sandbox
    - variant-4-performance: trust-level escape hatch
    → resolved: WASM is default, opt-in escape hatch via manifest flag,
      Oracle will flag escape hatch as deployment risk

phase-06-marketplace-discovery:
  primary_source: variant-5-ux
  secondary_source: variant-1-correctness (signature verification)
  contradictions: []

Output quality vs Cast: Cast on the same task produces a 5-phase plan with an in-process architecture. Alloy's blend produces a 7-phase plan with WASM sandboxing, a signed manifest, and an explicit escape hatch — a shape no single bias would have produced alone.

Example 2 — Replace REST API with GraphQL incrementally

High-stakes, low-reversibility, multiple reasonable approaches (strangler fig, dual-read/dual-write, facade-over-REST). The migration and verification biases specifically earn their keep at N=7.

bashinvocation
/anneal-alloy:anneal --versions 7 "Replace the existing REST API with GraphQL
incrementally. We have 43 REST endpoints in production serving 2.1M req/day,
no downtime tolerance, and six client teams that each control their own
migration timeline."
yamlMomus auditing the blend — verdict: CAUTION
verdict: CAUTION
findings:
  - issue: dual-read phase duration is underspecified
    severity: high
    demand: "Name the exact metric and value: 'cutover when GraphQL p95
             latency < REST p95 latency AND error_rate_delta < 0.1%
             over 7 days.'"
  - issue: client teams' rollout order is not sequenced
    severity: medium
    demand: "Add phase-N-client-rollout-sequencing.md — pin the order
             (start with lowest-traffic client, ratchet to highest)."
  - issue: schema diff tooling is not named
    severity: medium
    demand: "Pin the tool, or declare the contract-test approach."

The blend produces a 9-phase plan with explicit dual-read metrics, client rollout sequencing, and schema-diff tooling pinned to graphql-inspector. Temper on the same task would have converged after two or three depths on a similar shape; Alloy reaches it faster because the biases force the tradeoffs to surface in parallel.

Example 3 — Build a workflow orchestration engine

Pure greenfield. No existing infrastructure in the probe report to anchor on. The plan shape is open — event-sourced vs log-structured, SQL vs KV store, single-process vs distributed. This is where Alloy's breadth advantage is widest.

bashinvocation
/anneal-alloy:anneal "Build a workflow orchestration engine: durable execution,
step-level retries with backoff, human-in-the-loop pauses, and a web UI to
inspect runs. Inspired by Temporal but we want to own the code."
VariantPlan phasesShape produced
Cast (single planner)6Event-sourced, Postgres-backed, single-process, Next.js UI
Alloy (N=5)9Event-sourced + SINGLE_NODE mode + cluster-aware + distinct UI app — a shape no single bias would have produced alone
Temper (depth=3)7Event-sourced, Postgres-backed, single-process — progressively hardened retry semantics

On FAIL, route to Intent Gate

Alloy's re-loop is smarter than Cast's. On FAIL, it routes to Intent Gate, not to the Synthesizer. A failed synthesis suggests the bias mix was wrong — re-synthesizing the same N variants produces the same blend. Re-looping through Intent Gate gives Metis a chance to refine directives and lets the orchestrator pick a different N or a different bias set.

Re-loop typeTriggerMax iterations
Stage-4 re-loopMomus returns BLOCK on the blend; tournament re-runs with findings as constraints2 stage-4 re-loops before escalating to full re-loop
Full re-loopHephaestus FAIL; routes through Intent Gate with failure folded as directives3 by default; --loop lifts to unbounded

Know the edges

  1. N is capped at 7 for a reason
    The synthesizer's attention budget is finite; beyond 7 variants the blend starts to average instead of integrate. If your task needs N=10 you're probably using the wrong variant — try Temper with depth 5 instead.
  2. The synthesizer is a single agent
    If it makes an integration mistake, all N variants' quality is wasted. Momus is the backstop, but it only audits the blend, not the synthesizer's reasoning. Read synthesis-provenance.md whenever a blended plan feels off.
  3. Bias selection is heuristic, not optimal
    The N=5 default is a good general-purpose set, but some tasks would benefit from a custom bias list (e.g. "correctness + migration + verification" for a schema migration). Custom bias lists are a v0.3.0 feature.
  4. Tournament parallelism requires a real shell
    xargs -P is portable but some minimal container images omit it. If both nproc and sysctl -n hw.ncpu fail, the orchestrator falls back to sequential — and Alloy's wall-clock doubles.
  5. The synthesizer can invent phases
    In ~3% of runs the synthesizer produces a phase not present in any variant — it invented it from contradictions. These phases are flagged in synthesis-provenance.md as primary_source: synthesizer with no secondary source. Read them skeptically.

What Alloy writes to disk

Alloy writes everything Cast writes, plus the variant files and provenance sidecar. Preserve the variants/ directory — it's the richest learning signal in any Alloy run.

FileWhat's in it
plan/plan.mdOverview ≤80 lines, status, effort, dependencies
plan/phase-NN-*.mdDetailed phase files with success criteria and Hephaestus targets
rollup.yamlAll envelopes, gate statuses, simultaneous_pass, emission decision
{variant}-{run_id}.xmlOpus 4.7 semantic-XML prompt for one-shot execution
variants/variant-1-correctness.mdRaw output from the correctness-biased planner
variants/variant-2-minimalist.mdRaw output from the minimalist-biased planner
variants/variant-N-{bias}.mdOne file per bias
synthesis-provenance.mdPer-phase attribution, contradictions, resolution rationale
The tournament exists to reduce planning bias. N biased planners stretch the solution space on purpose. The synthesizer's job is to pick the strongest element from each and fold them into one coherent plan.anneal-alloy · docs