Alloy — Tournament Consensus
N biased planners compete in parallel. A synthesizer blends their best material into one plan, surfaces every tradeoff, and records which variant contributed which phase. Everything downstream is identical to Cast.
Seven lenses, one task
Every Prometheus-Alloy variant receives the same Metis directives and the same probe report — but a different lens. The lens is a single word that shifts what the planner optimizes for.
| Bias | Optimizes for | Tends to produce |
|---|---|---|
correctness | Exhaustive gate tests, phased rollout, every success criterion measurable | Long plans with explicit acceptance tests at every phase |
minimalist | Smallest viable plan, strip ceremony, YAGNI to the bone | Short plans, few files, minimum viable phases |
defensive | Rollback at every phase, checkpoint before risk, fail-safe defaults | Plans with backwards-compat envelopes and feature flags |
performance | Vendor only what's used, prune speculative infra, hot-path awareness | Plans that remove things as often as they add things |
ux | Status-line progress, helpful error messages, friendly failure paths | Plans with explicit error-message phases and telemetry |
verification | Instrument-before-theorize, reproducibility, every claim a metric | Plans that add observability as a phase-0 prerequisite |
migration | Every breaking change has a migration step, backwards compat, version gates | Plans with dual-read / dual-write transition phases |
--versions N in range [2, 7]. Default is 5. The plugin refuses --versions 1 ("use Cast for single-planner work") and --versions 8+ ("synthesizer signal-to-noise collapses beyond 7").| N | Biases included |
|---|---|
| 2 | correctness, minimalist |
| 3 | correctness, minimalist, defensive |
| 5 (default) | correctness, minimalist, defensive, performance, ux |
| 7 | all five plus verification and migration |
What the Synthesizer evaluates
The Synthesizer reads all N variant plans and scores each section against a shared rubric. It does not pick one variant as the winner — it picks the strongest section from each.
| Dimension | What counts as strong |
|---|---|
| Correctness | Every phase has an explicit success criterion and a measurable gate |
| Completeness | No implicit dependencies; every required service, env var, permission named |
| Risk posture | Breaking changes flagged, rollback strategy explicit, feature flags where risk is real |
| Scope hygiene | Out-of-scope section present and honest. No "we could also…" creep |
| Evidence plan | Hephaestus can actually validate Phase N — it has a buildable target |
| Sequencing | Phases topologically ordered by dependency; no forward references |
How the blend happens
The Synthesizer folds N variants in three passes. Each pass has a distinct goal: align structure, resolve contradictions, record provenance.
- Pass 1 — Structural alignmentAll N variants share the same section structure because they all received the same Metis directives. The Synthesizer builds a phase-by-phase comparison matrix and picks the strongest version of each phase across variants. Example: for Phase 0 (preflight), the defensive variant's structure is adopted, with the UX variant's user-facing error path folded in.
- Pass 2 — Contradiction resolutionWhen variants disagree — e.g. "use Redis for presence" vs. "use Postgres LISTEN/NOTIFY" — the Synthesizer falls back to Metis directives. If Metis named a service, pick that. If Metis was silent, prefer the variant that matches the probe report's detected stack. If still tied, surface the contradiction in
synthesis-provenance.mdand let the human decide. - Pass 3 — Provenance attributionEvery phase in the blended plan carries a
synthesis-provenanceannotation. It names the primary source variant, any secondary source, and lists contradictions with their resolution rationale. Provenance exists for audit: if the blended plan is wrong, you need to know which variant introduced the defect.
phase-00-preflight:
primary_source: variant-3-defensive
secondary_source: variant-5-ux (error-path section)
contradictions: []
phase-03-tenant-isolation-migration:
primary_source: variant-7-migration
secondary_source: variant-1-correctness (gate-test section)
contradictions:
- variant-2-minimalist recommended single-transaction migration;
rejected per Metis directive "phase the migration to avoid table lock"Budget before you commit
| Metric | N=3 | N=5 (default) | N=7 |
|---|---|---|---|
| Agent spawns | ~14 | ~18 | ~22 |
| Wall-clock | ~5 min | ~6 min | ~8 min |
| Worst case (re-loops) | ~18 min | ~22 min | ~28 min |
| Token cost (approx) | ~$0.80 | ~$1.20 | ~$1.70 |
| Disk per run | ~3–7 MB | ~5–10 MB | ~7–14 MB |
Three real Alloy runs
Example 1 — Design a plugin system for a CLI
Plugin-system design has at least four reasonable shapes — in-process, out-of-process subprocess, WASM sandbox, or a manifest-and-registry approach. A single planner will pick one and commit; the tournament surfaces the tradeoffs.
/anneal-alloy:anneal "Design a plugin system for the CLI with versioned lifecycle hooks, sandboxed execution, and a plugin discovery marketplace. Plugins should be installable by name, scoped to a user or project, and should expose commands, skills, and hooks."
phase-04-sandboxed-execution:
primary_source: variant-3-defensive (WASM-based sandbox)
secondary_source: variant-4-performance (in-process when plugin is trusted)
contradictions:
- variant-1-correctness: mandatory sandbox
- variant-4-performance: trust-level escape hatch
→ resolved: WASM is default, opt-in escape hatch via manifest flag,
Oracle will flag escape hatch as deployment risk
phase-06-marketplace-discovery:
primary_source: variant-5-ux
secondary_source: variant-1-correctness (signature verification)
contradictions: []Output quality vs Cast: Cast on the same task produces a 5-phase plan with an in-process architecture. Alloy's blend produces a 7-phase plan with WASM sandboxing, a signed manifest, and an explicit escape hatch — a shape no single bias would have produced alone.
Example 2 — Replace REST API with GraphQL incrementally
High-stakes, low-reversibility, multiple reasonable approaches (strangler fig, dual-read/dual-write, facade-over-REST). The migration and verification biases specifically earn their keep at N=7.
/anneal-alloy:anneal --versions 7 "Replace the existing REST API with GraphQL incrementally. We have 43 REST endpoints in production serving 2.1M req/day, no downtime tolerance, and six client teams that each control their own migration timeline."
verdict: CAUTION
findings:
- issue: dual-read phase duration is underspecified
severity: high
demand: "Name the exact metric and value: 'cutover when GraphQL p95
latency < REST p95 latency AND error_rate_delta < 0.1%
over 7 days.'"
- issue: client teams' rollout order is not sequenced
severity: medium
demand: "Add phase-N-client-rollout-sequencing.md — pin the order
(start with lowest-traffic client, ratchet to highest)."
- issue: schema diff tooling is not named
severity: medium
demand: "Pin the tool, or declare the contract-test approach."The blend produces a 9-phase plan with explicit dual-read metrics, client rollout sequencing, and schema-diff tooling pinned to graphql-inspector. Temper on the same task would have converged after two or three depths on a similar shape; Alloy reaches it faster because the biases force the tradeoffs to surface in parallel.
Example 3 — Build a workflow orchestration engine
Pure greenfield. No existing infrastructure in the probe report to anchor on. The plan shape is open — event-sourced vs log-structured, SQL vs KV store, single-process vs distributed. This is where Alloy's breadth advantage is widest.
/anneal-alloy:anneal "Build a workflow orchestration engine: durable execution, step-level retries with backoff, human-in-the-loop pauses, and a web UI to inspect runs. Inspired by Temporal but we want to own the code."
| Variant | Plan phases | Shape produced |
|---|---|---|
| Cast (single planner) | 6 | Event-sourced, Postgres-backed, single-process, Next.js UI |
| Alloy (N=5) | 9 | Event-sourced + SINGLE_NODE mode + cluster-aware + distinct UI app — a shape no single bias would have produced alone |
| Temper (depth=3) | 7 | Event-sourced, Postgres-backed, single-process — progressively hardened retry semantics |
On FAIL, route to Intent Gate
Alloy's re-loop is smarter than Cast's. On FAIL, it routes to Intent Gate, not to the Synthesizer. A failed synthesis suggests the bias mix was wrong — re-synthesizing the same N variants produces the same blend. Re-looping through Intent Gate gives Metis a chance to refine directives and lets the orchestrator pick a different N or a different bias set.
| Re-loop type | Trigger | Max iterations |
|---|---|---|
| Stage-4 re-loop | Momus returns BLOCK on the blend; tournament re-runs with findings as constraints | 2 stage-4 re-loops before escalating to full re-loop |
| Full re-loop | Hephaestus FAIL; routes through Intent Gate with failure folded as directives | 3 by default; --loop lifts to unbounded |
Know the edges
- N is capped at 7 for a reasonThe synthesizer's attention budget is finite; beyond 7 variants the blend starts to average instead of integrate. If your task needs N=10 you're probably using the wrong variant — try Temper with depth 5 instead.
- The synthesizer is a single agentIf it makes an integration mistake, all N variants' quality is wasted. Momus is the backstop, but it only audits the blend, not the synthesizer's reasoning. Read
synthesis-provenance.mdwhenever a blended plan feels off. - Bias selection is heuristic, not optimalThe N=5 default is a good general-purpose set, but some tasks would benefit from a custom bias list (e.g. "correctness + migration + verification" for a schema migration). Custom bias lists are a v0.3.0 feature.
- Tournament parallelism requires a real shell
xargs -Pis portable but some minimal container images omit it. If bothnprocandsysctl -n hw.ncpufail, the orchestrator falls back to sequential — and Alloy's wall-clock doubles. - The synthesizer can invent phasesIn ~3% of runs the synthesizer produces a phase not present in any variant — it invented it from contradictions. These phases are flagged in
synthesis-provenance.mdasprimary_source: synthesizerwith no secondary source. Read them skeptically.
What Alloy writes to disk
Alloy writes everything Cast writes, plus the variant files and provenance sidecar. Preserve the variants/ directory — it's the richest learning signal in any Alloy run.
| File | What's in it |
|---|---|
plan/plan.md | Overview ≤80 lines, status, effort, dependencies |
plan/phase-NN-*.md | Detailed phase files with success criteria and Hephaestus targets |
rollup.yaml | All envelopes, gate statuses, simultaneous_pass, emission decision |
{variant}-{run_id}.xml | Opus 4.7 semantic-XML prompt for one-shot execution |
variants/variant-1-correctness.md | Raw output from the correctness-biased planner |
variants/variant-2-minimalist.md | Raw output from the minimalist-biased planner |
variants/variant-N-{bias}.md | One file per bias |
synthesis-provenance.md | Per-phase attribution, contradictions, resolution rationale |