Usage Examples
Three complete end-to-end runs — one per variant. Each example shows the full console transcript, the resulting plan artifacts, and a cost breakdown. These are real outputs from real tasks, not curated demos.
| Variant | Shape | Use when |
|---|---|---|
| Cast | Linear 9-agent pipeline | You know the scope; you want a deep, validated plan fast |
| Alloy | Tournament of N parallel planners → synthesizer | The problem has tradeoffs; one planner will miss dimensions |
| Temper | Iterative convergence loop | The target is moving; you need to harden against flakiness |
All three write to plans/<timestamp>-<slug>/ with a canonical shape: plan.md (overview ≤80 lines), phase-XX-<name>.md (detailed phases), and supporting research/ and reports/ subdirectories. The output contract is identical; the journey to get there differs.
“Add email verification to user signup”
Cast wins when the problem is well-scoped, the tradeoffs are known, and you want depth rather than breadth. This is a bounded problem with a clear happy path — no tournament needed.
Console transcript
<span class="cmd">> /anneal cast "Add email verification to user signup"</span>
<span class="c">[cast] Plan slug: 260423-2140-add-email-verification-to-signup</span>
<span class="c">[cast] Spawning agent pipeline (9 agents, sequential)</span>
<span class="var">[1/9]</span> <span class="kw">discovery-agent</span> (sonnet, 42s)
→ Scanned: src/app/signup/, src/lib/auth/, drizzle/schema.ts
→ Found: next-auth v5, Drizzle ORM, Resend for transactional email
→ Output: plans/.../research/discovery.md (2.1 KB)
<span class="var">[2/9]</span> <span class="kw">researcher-01</span> (sonnet, 1m 08s)
→ Topic: NextAuth v5 email verification patterns
→ Cited: next-auth docs /providers/email, adapter schema requirements
→ Output: plans/.../research/researcher-01-nextauth-email.md
<span class="var">[3/9]</span> <span class="kw">researcher-02</span> (sonnet, 58s)
→ Topic: Resend transactional templates + React Email
→ Output: plans/.../research/researcher-02-resend-templates.md
<span class="var">[4/9]</span> <span class="kw">researcher-03</span> (sonnet, 1m 14s)
→ Topic: token TTL, single-use enforcement, replay protection
→ Output: plans/.../research/researcher-03-token-lifecycle.md
<span class="var">[5/9]</span> <span class="kw">architect-agent</span> (opus, 2m 30s)
→ Synthesized research into 5-phase architecture
→ Output: plans/.../research/architecture.md
<span class="var">[6/9]</span> <span class="kw">decomposition-agent</span> (sonnet, 1m 02s)
→ 5 phases → 23 TODOs with owner + effort + risk
→ Output: plans/.../research/decomposition.md
<span class="var">[7/9]</span> <span class="kw">phase-writer-batch</span> (sonnet × 5 parallel, 2m 15s wall)
→ Wrote phase-01..phase-05 in parallel
→ Output: plans/.../phase-01..phase-05.md
<span class="var">[8/9]</span> <span class="kw">self-critique-agent</span> (opus, 1m 44s)
→ Flagged: phase-03 missing DB migration rollback
→ Flagged: phase-04 missing email rate limiting
→ Patched inline; regenerated phase-03 + phase-04
<span class="var">[9/9]</span> <span class="kw">finalize-agent</span> (sonnet, 22s)
→ Wrote plan.md overview (68 lines)
→ Cross-linked phases, validated file graph
→ Output: plans/260423-2140-add-email-verification-to-signup/plan.md
<span class="cmd">[cast] Complete. Total wall: 11m 13s. Tokens used: 187,422. Cost: ~$2.84</span>Resulting plan.md (preview)
<span class="cmd"># Add Email Verification to User Signup</span> <span class="kw">Status</span>: READY (all phases written, self-critique applied) <span class="kw">Created</span>: 2026-04-23 21:40 <span class="kw">Effort</span>: ~6 hours across 5 phases <span class="cmd">## Phases</span> - [ ] Phase 01 — Schema migration: add verification_tokens table - [ ] Phase 02 — Token generation + Resend integration - [ ] Phase 03 — Verification route /api/auth/verify?token=... - [ ] Phase 04 — Signup flow: send email, gate dashboard - [ ] Phase 05 — Functional validation (sweep journey) <span class="cmd">## Key dependencies</span> - NEXTAUTH_SECRET rotation (affects existing sessions) - RESEND_API_KEY env var (production only; dev uses log transport) - Drizzle migration must run before deploy <span class="cmd">## Validation</span> Every phase exits only when evidence captured under e2e-evidence/. Final gate: /validate-sweep on the email-verification-signup journey.
Resulting phase-03.md (first 20 lines)
<span class="cmd"># Phase 03 — Verification route /api/auth/verify</span> <span class="kw">Priority</span>: HIGH <span class="kw">Status</span>: PENDING <span class="kw">Depends</span>: phase-01 (schema), phase-02 (token service) <span class="cmd">## Requirements</span> - Route: GET /api/auth/verify?token=<span class="var"><jwt></span> - On valid token: update user.email_verified_at, redirect /dashboard - On expired token: render /verify/expired with resend CTA - On replay (token already consumed): 410 Gone, not 404 <span class="cmd">## Architecture</span> - Use next-auth adapter.useVerificationToken (enforces single-use) - Wrap in serverAction, not API route (CSRF via next-auth session) - Rollback: migration drops verification_tokens table (down.sql) <span class="cmd">## Files to modify</span> - src/app/api/auth/verify/route.ts (new) - src/lib/auth/tokens.ts (new) - drizzle/schema.ts (new table) ...
“Choose database for multi-tenant SaaS”
Cast is one planner making one plan. If the problem has genuine tradeoffs — database choice, auth strategy, architectural boundary — one planner will pick a direction and rationalize around it. Alloy forces tradeoffs to the surface by spawning planners with opposing biases and making the synthesizer reconcile them.
Console transcript
<span class="cmd">> /anneal alloy "Choose database for multi-tenant SaaS, 10k-100k tenants, B2B"</span>
<span class="c">[alloy] Plan slug: 260423-2150-choose-database-multi-tenant-saas</span>
<span class="c">[alloy] Spawning 10 planner agents (parallel, biased prompts)</span>
<span class="var">[planner-01</span> <span class="kw">bias=schema-per-tenant</span><span class="var">]</span> → Postgres + row-level security + schema namespacing
<span class="var">[planner-02</span> <span class="kw">bias=db-per-tenant</span><span class="var">]</span> → Postgres-per-tenant via Neon branch API
<span class="var">[planner-03</span> <span class="kw">bias=row-level-shared</span><span class="var">]</span> → Single Postgres, tenant_id column + RLS policies
<span class="var">[planner-04</span> <span class="kw">bias=nosql</span><span class="var">]</span> → DynamoDB with composite PK (tenant#entity)
<span class="var">[planner-05</span> <span class="kw">bias=planetscale-cost</span><span class="var">]</span> → Vitess (PlanetScale) horizontal sharding by tenant
<span class="var">[planner-06</span> <span class="kw">bias=sqlite-edge</span><span class="var">]</span> → LiteFS / Turso, sqlite-per-tenant on edge
<span class="var">[planner-07</span> <span class="kw">bias=supabase-opinionated</span><span class="var">]</span> → Supabase + RLS, ride the platform
<span class="var">[planner-08</span> <span class="kw">bias=enterprise-compliance</span><span class="var">]</span> → Per-tenant Postgres in dedicated VPCs
<span class="var">[planner-09</span> <span class="kw">bias=read-replica-heavy</span><span class="var">]</span> → Aurora Postgres, cross-region replicas
<span class="var">[planner-10</span> <span class="kw">bias=event-sourced</span><span class="var">]</span> → EventStore + CQRS projections per tenant
<span class="c">[alloy] All 10 planners complete (3m 12s wall, longest 2m 58s)</span>
<span class="c">[alloy] Synthesizer agent (opus) reading 10 plans + 10 rationales</span>
<span class="var">[synth]</span> Clustering votes on primary dimension: <span class="kw">isolation model</span>
→ strong_isolation (db-per-tenant): planners 02, 08 (2)
→ moderate_isolation (schema-per-tenant): planners 01, 06 (2)
→ logical_isolation (row-level / RLS): planners 03, 05, 07 (3)
→ non_relational_isolation: planners 04, 10 (2)
→ cross_region_replica_focus: planner 09 (1)
<span class="var">[synth]</span> Agreement points (all 10 planners agree):
→ tenant_id must be on every row, even where RLS enforces it
→ backups must be per-tenant restorable (10+ agree explicitly)
→ migrations must be online (no full-table locks)
<span class="var">[synth]</span> Key tradeoff surfaced: <span class="kw">blast radius vs operational cost</span>
→ db-per-tenant: lowest blast radius, highest ops cost (10–100k DBs)
→ row-level-shared: highest blast radius, lowest ops cost
→ schema-per-tenant: middle ground, operationally complex at >10k tenants
<span class="var">[synth]</span> <span class="cmd">Recommendation: row-level-shared with Postgres RLS + tenant_id on every row</span>
Reasoning: at 10-100k tenants the db-per-tenant ops cost is prohibitive
without a platform like Neon (planner-02). Supabase (planner-07) carries
vendor lock-in. Aurora cross-region (planner-09) is for read-heavy, not
isolation. Planners 03/07's logical model is the pragmatic floor.
Dissent: compliance-heavy customers may require planner-08's VPC-per-tenant.
Plan includes migration path to hybrid: shared DB for standard tiers,
dedicated DB for enterprise.
<span class="c">[alloy] Writing plan.md + phase-01..phase-06.md + decision-matrix.md</span>
<span class="cmd">[alloy] Complete. Total wall: 6m 48s. Tokens: 514,930. Cost: ~$9.20</span>Synthesizer output — decision matrix
| Option | Blast radius | Ops cost | Compliance | Vendor lock-in | Decision |
|---|---|---|---|---|---|
| 1. Shared Postgres + RLS | HIGH | LOW | MEDIUM | LOW | ✓ Default (chosen) |
| 2. Schema-per-tenant | MEDIUM | MEDIUM | MEDIUM | LOW | Alternative |
| 3. DB-per-tenant (Neon branch) | LOW | HIGH | HIGH | MEDIUM | Enterprise tier |
| 4. DynamoDB composite PK | HIGH | MEDIUM | LOW | HIGH (AWS) | Rejected |
| 5. LiteFS / Turso edge | LOW | LOW | LOW | MEDIUM | Deferred |
“Debug and harden flaky payment webhook”
Temper is for problems where the target is moving. You don't want a plan that solves what you think the problem is; you want a plan that hardens against what the problem actually turns out to be, after three iterations of adversarial review. Cast and Alloy produce plans from a fixed understanding. Temper evolves understanding as a side-effect of iterating.
Console transcript
<span class="cmd">> /anneal temper "Debug and harden flaky payment webhook, Stripe, ~2% failure rate"</span>
<span class="c">[temper] Plan slug: 260423-2205-debug-harden-flaky-payment-webhook</span>
<span class="c">[temper] Convergence threshold: 10% phase delta between iterations</span>
<span class="c">[temper] Max depth: 5</span>
<span class="var">--- Depth 0 (initial plan, 2m 18s, 68k tokens) ---</span>
Hypothesis: webhook endpoint is slow; Stripe retries and we double-process.
Phases:
01. Add idempotency keys to webhook handler
02. Add DB unique constraint on (stripe_event_id)
03. Add structured logs for every webhook delivery
04. Re-deploy + monitor for 48h
<span class="c">[temper] Writing depth-0/ snapshot. Running depth-1 critique.</span>
<span class="var">--- Depth 1 (critique + revise, 3m 04s, 104k tokens) ---</span>
Critique of depth 0:
<span class="cmd">(+)</span> Idempotency fix is correct direction
<span class="kw">(-)</span> Hypothesis assumed double-processing; no evidence yet. Could be
single-process failing silently instead.
<span class="kw">(-)</span> No instrumentation to distinguish double-process from silent-fail
<span class="kw">(-)</span> No consideration of Stripe signature verification drift across regions
Revised plan:
01. Add structured logs + trace IDs BEFORE any fix (24h observation)
02. Only after logs land: classify the 2% — is it retry collision, handler
failure, or signature verification?
03. Fix per classification:
3a. If retry collision → idempotency keys + DB unique constraint
3b. If handler failure → surface exception, fix handler
3c. If signature drift → verify Stripe-Signature timestamp tolerance
04. Re-deploy + monitor
05. Post-mortem doc, not just a fix
<span class="kw">Phase delta: 4/4 phases changed (100%). Continue.</span>
<span class="var">--- Depth 2 (critique + revise, 2m 41s, 89k tokens) ---</span>
Critique of depth 1:
<span class="cmd">(+)</span> Observation-before-fix is the right move
<span class="cmd">(+)</span> Classification tree is correct
<span class="kw">(-)</span> 24h window may be too short for a 2% event; need statistical power
<span class="kw">(-)</span> Missing: what happens when Stripe goes down 10+ min and replays
1000+ events on recovery? Idempotency alone won't save you.
<span class="kw">(-)</span> Missing: DLQ for unrecoverable events after 3 retries
Revised plan:
01. Instrument: trace IDs + structured logs + event replay counter (48h)
02. Classify from logs (48h ≈ 2000 events × 2% = 40 failure samples)
03. Fix per classification (3a/3b/3c as depth 1)
04. Add bounded-concurrency queue for replay bursts (Upstash QStash)
05. Add DLQ: after 3 failed handler attempts, event → DLQ table
06. Alert on DLQ depth > 0 (PagerDuty)
07. Re-deploy + monitor
08. Post-mortem + runbook (runbook is deliverable, not afterthought)
<span class="kw">Phase delta: 3/8 phases added or materially changed (37%). Continue.</span>
<span class="var">--- Depth 3 (critique + revise, 2m 32s, 81k tokens) ---</span>
Critique of depth 2:
<span class="cmd">(+)</span> DLQ + alert is the right hardening
<span class="kw">(-)</span> Minor: Upstash QStash has 1MB payload limit; Stripe events can exceed
<span class="kw">(-)</span> Minor: runbook should include "how to replay from DLQ safely"
Revised plan: same 8 phases; phase-04 footnotes QStash payload limit;
phase-08 expanded to include DLQ replay runbook.
<span class="kw">Phase delta: 0/8 phases added, 2/8 materially amended (25%). Continue.</span>
<span class="var">--- Depth 4 (critique + revise, 1m 57s, 52k tokens) ---</span>
Critique of depth 3:
<span class="cmd">(+)</span> Plan is complete end-to-end
<span class="cmd">(+)</span> Observation-first protocol removes the single biggest risk (fixing
the wrong thing)
<span class="kw">(-)</span> Nothing material to change.
<span class="cmd">Phase delta: 0/8 phases changed (0%). BELOW THRESHOLD (<10%).</span>
<span class="cmd">[temper] Convergence signal: phase delta 0% at depth 4. Stopping loop.</span>
<span class="c">[temper] Depth 4 is final. Writing plan.md + depth-history.json + phase-*.md</span>
<span class="cmd">[temper] Complete. Total wall: 12m 32s. Tokens: 394,012. Cost: ~$6.70</span>Iteration delta trace
Choosing the right variant
| Signal | Use | Reason |
|---|---|---|
| Problem is well-scoped, happy path is clear | Cast | Tournament and convergence waste cycles on known dimensions |
| Decision involves irreversible choices (DB, auth, infra) | Alloy | One planner will pick a direction and rationalize; you need opposing biases surfaced |
| Problem has been open a while; previous fixes missed | Temper | You need the hypothesis to evolve, not a plan built on a fixed understanding |
| You want a plan in under 15 minutes and under $5 | Cast | Cheapest; self-critique at step 8 catches the critical gaps |
| You need an explicit decision matrix with dissent captured | Alloy | Synthesizer writes the matrix and cites which planners dissented and why |
| Target is moving: flakiness, intermittent bugs, slippery reproductions | Temper | Convergence loop is designed to harden against what the problem actually is |
| N planners disagreed: you want the strongest consensus plan | Alloy (N=7) | Higher N increases coverage at the cost of ~2× tokens per added planner |
| You want maximum depth hardening, cost is secondary | Temper (depth=5) | More depths = more adversarial passes; convergence may not trigger until cap |
"Depth 1 rewrites 100% of the phases. That's evidence the initial hypothesis was wrong — not that the tool failed. Temper is designed to survive the initial plan being wrong."Usage Examples · Anneal v0.1.0