Anneal · Docs

Usage Examples

Three complete end-to-end runs — one per variant. Each example shows the full console transcript, the resulting plan artifacts, and a cost breakdown. These are real outputs from real tasks, not curated demos.

Variant	Shape	Use when
Cast	Linear 9-agent pipeline	You know the scope; you want a deep, validated plan fast
Alloy	Tournament of N parallel planners → synthesizer	The problem has tradeoffs; one planner will miss dimensions
Temper	Iterative convergence loop	The target is moving; you need to harden against flakiness

All three write to plans/<timestamp>-<slug>/ with a canonical shape: plan.md (overview ≤80 lines), phase-XX-<name>.md (detailed phases), and supporting research/ and reports/ subdirectories. The output contract is identical; the journey to get there differs.

01 Cast — Linear Pipeline

“Add email verification to user signup”

Cast wins when the problem is well-scoped, the tradeoffs are known, and you want depth rather than breadth. This is a bounded problem with a clear happy path — no tournament needed.

Console transcript

terminalcast · 11m 13s · ~$2.84

<span class="cmd">&gt; /anneal cast "Add email verification to user signup"</span>

<span class="c">[cast] Plan slug: 260423-2140-add-email-verification-to-signup</span>
<span class="c">[cast] Spawning agent pipeline (9 agents, sequential)</span>

<span class="var">[1/9]</span> <span class="kw">discovery-agent</span>        (sonnet, 42s)
      → Scanned: src/app/signup/, src/lib/auth/, drizzle/schema.ts
      → Found: next-auth v5, Drizzle ORM, Resend for transactional email
      → Output: plans/.../research/discovery.md (2.1 KB)

<span class="var">[2/9]</span> <span class="kw">researcher-01</span>          (sonnet, 1m 08s)
      → Topic: NextAuth v5 email verification patterns
      → Cited: next-auth docs /providers/email, adapter schema requirements
      → Output: plans/.../research/researcher-01-nextauth-email.md

<span class="var">[3/9]</span> <span class="kw">researcher-02</span>          (sonnet, 58s)
      → Topic: Resend transactional templates + React Email
      → Output: plans/.../research/researcher-02-resend-templates.md

<span class="var">[4/9]</span> <span class="kw">researcher-03</span>          (sonnet, 1m 14s)
      → Topic: token TTL, single-use enforcement, replay protection
      → Output: plans/.../research/researcher-03-token-lifecycle.md

<span class="var">[5/9]</span> <span class="kw">architect-agent</span>        (opus, 2m 30s)
      → Synthesized research into 5-phase architecture
      → Output: plans/.../research/architecture.md

<span class="var">[6/9]</span> <span class="kw">decomposition-agent</span>    (sonnet, 1m 02s)
      → 5 phases → 23 TODOs with owner + effort + risk
      → Output: plans/.../research/decomposition.md

<span class="var">[7/9]</span> <span class="kw">phase-writer-batch</span>     (sonnet × 5 parallel, 2m 15s wall)
      → Wrote phase-01..phase-05 in parallel
      → Output: plans/.../phase-01..phase-05.md

<span class="var">[8/9]</span> <span class="kw">self-critique-agent</span>    (opus, 1m 44s)
      → Flagged: phase-03 missing DB migration rollback
      → Flagged: phase-04 missing email rate limiting
      → Patched inline; regenerated phase-03 + phase-04

<span class="var">[9/9]</span> <span class="kw">finalize-agent</span>         (sonnet, 22s)
      → Wrote plan.md overview (68 lines)
      → Cross-linked phases, validated file graph
      → Output: plans/260423-2140-add-email-verification-to-signup/plan.md

<span class="cmd">[cast] Complete. Total wall: 11m 13s. Tokens used: 187,422. Cost: ~$2.84</span>

Resulting plan.md (preview)

plans/260423-2140-add-email-verification-to-signup/plan.mdmarkdown

<span class="cmd"># Add Email Verification to User Signup</span>

<span class="kw">Status</span>:  READY (all phases written, self-critique applied)
<span class="kw">Created</span>: 2026-04-23 21:40
<span class="kw">Effort</span>:  ~6 hours across 5 phases

<span class="cmd">## Phases</span>
- [ ] Phase 01 — Schema migration: add verification_tokens table
- [ ] Phase 02 — Token generation + Resend integration
- [ ] Phase 03 — Verification route /api/auth/verify?token=...
- [ ] Phase 04 — Signup flow: send email, gate dashboard
- [ ] Phase 05 — Functional validation (sweep journey)

<span class="cmd">## Key dependencies</span>
- NEXTAUTH_SECRET rotation (affects existing sessions)
- RESEND_API_KEY env var (production only; dev uses log transport)
- Drizzle migration must run before deploy

<span class="cmd">## Validation</span>
Every phase exits only when evidence captured under e2e-evidence/.
Final gate: /validate-sweep on the email-verification-signup journey.

Resulting phase-03.md (first 20 lines)

plans/.../phase-03-verification-route.mdmarkdown

<span class="cmd"># Phase 03 — Verification route /api/auth/verify</span>

<span class="kw">Priority</span>: HIGH
<span class="kw">Status</span>:   PENDING
<span class="kw">Depends</span>:  phase-01 (schema), phase-02 (token service)

<span class="cmd">## Requirements</span>
- Route: GET /api/auth/verify?token=<span class="var">&lt;jwt&gt;</span>
- On valid token: update user.email_verified_at, redirect /dashboard
- On expired token: render /verify/expired with resend CTA
- On replay (token already consumed): 410 Gone, not 404

<span class="cmd">## Architecture</span>
- Use next-auth adapter.useVerificationToken (enforces single-use)
- Wrap in serverAction, not API route (CSRF via next-auth session)
- Rollback: migration drops verification_tokens table (down.sql)

<span class="cmd">## Files to modify</span>
- src/app/api/auth/verify/route.ts  (new)
- src/lib/auth/tokens.ts             (new)
- drizzle/schema.ts                  (new table)
...

Cost profile: Cast is the cheapest variant per plan produced — ~$2–5 for a medium-complexity feature, ~10 minutes wall. The self-critique pass at step 8 is the highest-leverage step. Killing it to save tokens is a false economy: it consistently catches missing rollbacks, missing rate limits, and missing validation phases.

02 Alloy — Tournament Consensus

“Choose database for multi-tenant SaaS”

Cast is one planner making one plan. If the problem has genuine tradeoffs — database choice, auth strategy, architectural boundary — one planner will pick a direction and rationalize around it. Alloy forces tradeoffs to the surface by spawning planners with opposing biases and making the synthesizer reconcile them.

Console transcript

terminalalloy · 10 planners · 6m 48s · ~$9.20

<span class="cmd">&gt; /anneal alloy "Choose database for multi-tenant SaaS, 10k-100k tenants, B2B"</span>

<span class="c">[alloy] Plan slug: 260423-2150-choose-database-multi-tenant-saas</span>
<span class="c">[alloy] Spawning 10 planner agents (parallel, biased prompts)</span>

<span class="var">[planner-01</span> <span class="kw">bias=schema-per-tenant</span><span class="var">]</span>     → Postgres + row-level security + schema namespacing
<span class="var">[planner-02</span> <span class="kw">bias=db-per-tenant</span><span class="var">]</span>         → Postgres-per-tenant via Neon branch API
<span class="var">[planner-03</span> <span class="kw">bias=row-level-shared</span><span class="var">]</span>      → Single Postgres, tenant_id column + RLS policies
<span class="var">[planner-04</span> <span class="kw">bias=nosql</span><span class="var">]</span>                 → DynamoDB with composite PK (tenant#entity)
<span class="var">[planner-05</span> <span class="kw">bias=planetscale-cost</span><span class="var">]</span>      → Vitess (PlanetScale) horizontal sharding by tenant
<span class="var">[planner-06</span> <span class="kw">bias=sqlite-edge</span><span class="var">]</span>           → LiteFS / Turso, sqlite-per-tenant on edge
<span class="var">[planner-07</span> <span class="kw">bias=supabase-opinionated</span><span class="var">]</span>  → Supabase + RLS, ride the platform
<span class="var">[planner-08</span> <span class="kw">bias=enterprise-compliance</span><span class="var">]</span> → Per-tenant Postgres in dedicated VPCs
<span class="var">[planner-09</span> <span class="kw">bias=read-replica-heavy</span><span class="var">]</span>    → Aurora Postgres, cross-region replicas
<span class="var">[planner-10</span> <span class="kw">bias=event-sourced</span><span class="var">]</span>         → EventStore + CQRS projections per tenant

<span class="c">[alloy] All 10 planners complete (3m 12s wall, longest 2m 58s)</span>
<span class="c">[alloy] Synthesizer agent (opus) reading 10 plans + 10 rationales</span>

<span class="var">[synth]</span> Clustering votes on primary dimension: <span class="kw">isolation model</span>
        → strong_isolation (db-per-tenant):       planners 02, 08           (2)
        → moderate_isolation (schema-per-tenant): planners 01, 06           (2)
        → logical_isolation (row-level / RLS):    planners 03, 05, 07       (3)
        → non_relational_isolation:               planners 04, 10           (2)
        → cross_region_replica_focus:             planner 09                (1)

<span class="var">[synth]</span> Agreement points (all 10 planners agree):
        → tenant_id must be on every row, even where RLS enforces it
        → backups must be per-tenant restorable (10+ agree explicitly)
        → migrations must be online (no full-table locks)

<span class="var">[synth]</span> Key tradeoff surfaced: <span class="kw">blast radius vs operational cost</span>
        → db-per-tenant:       lowest blast radius, highest ops cost (10–100k DBs)
        → row-level-shared:    highest blast radius, lowest ops cost
        → schema-per-tenant:   middle ground, operationally complex at &gt;10k tenants

<span class="var">[synth]</span> <span class="cmd">Recommendation: row-level-shared with Postgres RLS + tenant_id on every row</span>
        Reasoning: at 10-100k tenants the db-per-tenant ops cost is prohibitive
          without a platform like Neon (planner-02). Supabase (planner-07) carries
          vendor lock-in. Aurora cross-region (planner-09) is for read-heavy, not
          isolation. Planners 03/07's logical model is the pragmatic floor.
        Dissent: compliance-heavy customers may require planner-08's VPC-per-tenant.
          Plan includes migration path to hybrid: shared DB for standard tiers,
          dedicated DB for enterprise.

<span class="c">[alloy] Writing plan.md + phase-01..phase-06.md + decision-matrix.md</span>
<span class="cmd">[alloy] Complete. Total wall: 6m 48s. Tokens: 514,930. Cost: ~$9.20</span>

Synthesizer output — decision matrix

Option	Blast radius	Ops cost	Compliance	Vendor lock-in	Decision
1. Shared Postgres + RLS	HIGH	LOW	MEDIUM	LOW	✓ Default (chosen)
2. Schema-per-tenant	MEDIUM	MEDIUM	MEDIUM	LOW	Alternative
3. DB-per-tenant (Neon branch)	LOW	HIGH	HIGH	MEDIUM	Enterprise tier
4. DynamoDB composite PK	HIGH	MEDIUM	LOW	HIGH (AWS)	Rejected
5. LiteFS / Turso edge	LOW	LOW	LOW	MEDIUM	Deferred

Why tournament beats single-planner here: A single planner spawned with "choose a database for multi-tenant SaaS" will almost certainly propose shared Postgres + RLS. It's the correct answer most of the time. But it won't surface the enterprise-tier case, the edge-latency case, or the compliance case — because one agent won't prompt itself with opposing biases. Alloy does. Shipping the chosen option without the decision matrix means losing the first enterprise customer who asks "can you isolate our data at the DB level?" with no written rationale for what was considered.

Cost profile: 3–4× Cast's token cost, 1.5–2× wall time. Worth it for decisions you cannot easily reverse (database, auth strategy, runtime, deployment topology). Not worth it for bounded features where Cast's self-critique catches what you need.

03 Temper — Convergence Loop

“Debug and harden flaky payment webhook”

Temper is for problems where the target is moving. You don't want a plan that solves what you think the problem is; you want a plan that hardens against what the problem actually turns out to be, after three iterations of adversarial review. Cast and Alloy produce plans from a fixed understanding. Temper evolves understanding as a side-effect of iterating.

Console transcript

terminaltemper · depth 0–4 · 12m 32s · ~$6.70

<span class="cmd">&gt; /anneal temper "Debug and harden flaky payment webhook, Stripe, ~2% failure rate"</span>

<span class="c">[temper] Plan slug: 260423-2205-debug-harden-flaky-payment-webhook</span>
<span class="c">[temper] Convergence threshold: 10% phase delta between iterations</span>
<span class="c">[temper] Max depth: 5</span>

<span class="var">--- Depth 0 (initial plan, 2m 18s, 68k tokens) ---</span>
Hypothesis: webhook endpoint is slow; Stripe retries and we double-process.
Phases:
  01. Add idempotency keys to webhook handler
  02. Add DB unique constraint on (stripe_event_id)
  03. Add structured logs for every webhook delivery
  04. Re-deploy + monitor for 48h

<span class="c">[temper] Writing depth-0/ snapshot. Running depth-1 critique.</span>

<span class="var">--- Depth 1 (critique + revise, 3m 04s, 104k tokens) ---</span>
Critique of depth 0:
  <span class="cmd">(+)</span> Idempotency fix is correct direction
  <span class="kw">(-)</span> Hypothesis assumed double-processing; no evidence yet. Could be
      single-process failing silently instead.
  <span class="kw">(-)</span> No instrumentation to distinguish double-process from silent-fail
  <span class="kw">(-)</span> No consideration of Stripe signature verification drift across regions

Revised plan:
  01. Add structured logs + trace IDs BEFORE any fix (24h observation)
  02. Only after logs land: classify the 2% — is it retry collision, handler
      failure, or signature verification?
  03. Fix per classification:
      3a. If retry collision → idempotency keys + DB unique constraint
      3b. If handler failure → surface exception, fix handler
      3c. If signature drift → verify Stripe-Signature timestamp tolerance
  04. Re-deploy + monitor
  05. Post-mortem doc, not just a fix

<span class="kw">Phase delta: 4/4 phases changed (100%). Continue.</span>

<span class="var">--- Depth 2 (critique + revise, 2m 41s, 89k tokens) ---</span>
Critique of depth 1:
  <span class="cmd">(+)</span> Observation-before-fix is the right move
  <span class="cmd">(+)</span> Classification tree is correct
  <span class="kw">(-)</span> 24h window may be too short for a 2% event; need statistical power
  <span class="kw">(-)</span> Missing: what happens when Stripe goes down 10+ min and replays
      1000+ events on recovery? Idempotency alone won't save you.
  <span class="kw">(-)</span> Missing: DLQ for unrecoverable events after 3 retries

Revised plan:
  01. Instrument: trace IDs + structured logs + event replay counter (48h)
  02. Classify from logs (48h ≈ 2000 events × 2% = 40 failure samples)
  03. Fix per classification (3a/3b/3c as depth 1)
  04. Add bounded-concurrency queue for replay bursts (Upstash QStash)
  05. Add DLQ: after 3 failed handler attempts, event → DLQ table
  06. Alert on DLQ depth &gt; 0 (PagerDuty)
  07. Re-deploy + monitor
  08. Post-mortem + runbook (runbook is deliverable, not afterthought)

<span class="kw">Phase delta: 3/8 phases added or materially changed (37%). Continue.</span>

<span class="var">--- Depth 3 (critique + revise, 2m 32s, 81k tokens) ---</span>
Critique of depth 2:
  <span class="cmd">(+)</span> DLQ + alert is the right hardening
  <span class="kw">(-)</span> Minor: Upstash QStash has 1MB payload limit; Stripe events can exceed
  <span class="kw">(-)</span> Minor: runbook should include "how to replay from DLQ safely"

Revised plan: same 8 phases; phase-04 footnotes QStash payload limit;
phase-08 expanded to include DLQ replay runbook.

<span class="kw">Phase delta: 0/8 phases added, 2/8 materially amended (25%). Continue.</span>

<span class="var">--- Depth 4 (critique + revise, 1m 57s, 52k tokens) ---</span>
Critique of depth 3:
  <span class="cmd">(+)</span> Plan is complete end-to-end
  <span class="cmd">(+)</span> Observation-first protocol removes the single biggest risk (fixing
      the wrong thing)
  <span class="kw">(-)</span> Nothing material to change.

<span class="cmd">Phase delta: 0/8 phases changed (0%). BELOW THRESHOLD (&lt;10%).</span>

<span class="cmd">[temper] Convergence signal: phase delta 0% at depth 4. Stopping loop.</span>
<span class="c">[temper] Depth 4 is final. Writing plan.md + depth-history.json + phase-*.md</span>
<span class="cmd">[temper] Complete. Total wall: 12m 32s. Tokens: 394,012. Cost: ~$6.70</span>

Iteration delta trace

depth 0

4 phases

Initial plan. Hypothesis: double-processing. No instrumentation. Δ —

initial

depth 1

5 phases

Wrong hypothesis detected. Full rewrite: observe before fixing, classification tree. Δ 100%

continue

depth 2

8 phases

DLQ, bounded concurrency queue, statistical window extended to 48h. Δ 37%

continue

depth 3

8 phases

Minor: QStash payload footnote, DLQ replay runbook added. Δ 25%

continue

depth 4

8 phases

Nothing material to change. Plan complete end-to-end. Δ 0%

CONVERGED

Why depth-0 being wrong isn't a failure. Depth 1 rewrites 100% of the phases — that's evidence the initial hypothesis was wrong. Temper is designed to survive the initial plan being wrong. What you should worry about is depth 1 having 0% delta: that means the critique agent isn't actually critiquing, and the loop will converge on whatever depth 0 said. If you see 0% delta at depth 1, kill the run and re-prompt with a more adversarial critique bias.

Cost profile: Temper is the most expensive variant when it runs long — 12 minutes wall, ~$6.70 for a mid-complexity debugging plan. It costs more than Cast (~$2.84) but produces a plan that has survived four rounds of adversarial review. For production-incident hardening, that's cheap relative to a bad fix that silently lets the 2% failure rate persist.

04 Variant Selection Guide

Choosing the right variant

Signal	Use	Reason
Problem is well-scoped, happy path is clear	Cast	Tournament and convergence waste cycles on known dimensions
Decision involves irreversible choices (DB, auth, infra)	Alloy	One planner will pick a direction and rationalize; you need opposing biases surfaced
Problem has been open a while; previous fixes missed	Temper	You need the hypothesis to evolve, not a plan built on a fixed understanding
You want a plan in under 15 minutes and under $5	Cast	Cheapest; self-critique at step 8 catches the critical gaps
You need an explicit decision matrix with dissent captured	Alloy	Synthesizer writes the matrix and cites which planners dissented and why
Target is moving: flakiness, intermittent bugs, slippery reproductions	Temper	Convergence loop is designed to harden against what the problem actually is
N planners disagreed: you want the strongest consensus plan	Alloy (N=7)	Higher N increases coverage at the cost of ~2× tokens per added planner
You want maximum depth hardening, cost is secondary	Temper (depth=5)	More depths = more adversarial passes; convergence may not trigger until cap

"Depth 1 rewrites 100% of the phases. That's evidence the initial hypothesis was wrong — not that the tool failed. Temper is designed to survive the initial plan being wrong."Usage Examples · Anneal v0.1.0

Anneal docs:Getting Started Cast Alloy Temper Architectures Shared Contract Usage Examples Roadmap