Anneal · Docs

Roadmap

Three variants today. One unified runtime tomorrow. Cross-plugin integration the year after. This is the living roadmap for Anneal — what's landed, what's next, and what the long arc looks like. Dates are approximate targets; priorities shift based on real user signal.

shipped2026-04-22

v0.1.0

Three variants shipping separately

planned2026-06

v0.2.0

Unified /anneal command + shared state

planned2026-09

v0.3.0

Rubric customization per project

planned2026-12

v1.0.0

ValidationForge round-trip + CLI-first

01 v0.1.0 — Three Variants Shipping Separately

Current release

released 2026-04-22

All three variants structurally valid, Cast E2E-verified in a live Claude Code session, Alloy and Temper passed load verification.

Three installable plugins
anneal-cast, anneal-alloy, anneal-temper — each a complete Claude Code plugin with its own manifest, commands, skills, agents, hooks, and validation scripts.
Eight-invariant contract
Red Team Trinity always runs, Hephaestus validation always runs, dual output, skill enrichment, unbounded re-loop, parallelization by default, category routing, dual prompts by model family.
Greek-god agent roster
Metis, Prometheus (per-variant), Synthesizer (Alloy), Deepen-Loop (Temper), Momus, Red-Team Trinity, Oracle, Hephaestus, Atlas.
Umbrella marketplace
anneal-umbrella-dev lists all three plugins; install any subset.
Opus 4.7 semantic-XML output schema
_shared/opus-47-xml-schema.md is the canonical format; validator ships as scripts/validate-xml.py.
Per-variant validation scripts
scripts/validate-plugin.py in each variant; runs at install time and on-demand.

Known issues at v0.1.0

Alloy's full E2E run is still in progress — load verification passed, real-plan emission pending.
Temper's convergence check is deterministic but its scoring rubric is calibrated against a small reference task set; scores on genuinely novel shapes may need adjustment.
The three plugins share a _shared/ tree but each vendors its own copy. Divergence risk is real; v0.2.0 addresses this.

02 v0.2.0 — Unified Command + Shared State

Stop typing three different commands

targeting 2026-06

Stop making users type anneal-cast:anneal, anneal-alloy:anneal, anneal-temper:anneal. One command, three backends, auto-select.

Single /anneal command

command syntaxbash

# Old (v0.1.0) — three separate command namespaces
/anneal-cast:anneal "task"
/anneal-alloy:anneal "task" --versions 5
/anneal-temper:anneal "task" --depth 3

# New (v0.2.0) — one command, three backends
/anneal "task"                       # auto-select variant from task shape
/anneal --cast "task"               # force Cast
/anneal --alloy "task" --versions 5  # force Alloy, N=5
/anneal --temper "task" --depth 3    # force Temper, depth cap 3

The auto-select heuristic reads the probe report and Metis's preliminary classification:

Task classification	Auto-selected variant
`bug-fix \| scoped-refactor \| documentation`	Cast
`new-feature \| infra-change with clear spec`	Cast
`greenfield-architecture \| novel-design`	Alloy
`complex-scoped \| iteration-friendly`	Temper (default depth 3)
`unknown`	Surface the decision to the user

Shared .anneal/runs/ state directory

directory structuretree

.anneal/
├── runs/
│   ├── {variant}-{run_id}/         # one dir per run, all variants
│   └── ...
├── manifest.json               # index of all runs with metadata
└── comparisons/                # cross-variant comparisons
    └── {task-slug}/
        ├── cast-output/
        ├── alloy-output/
        └── temper-output/

anneal compare <task-slug> renders a side-by-side diff of the three outputs on the same task — the canonical tool for deciding which variant produces stronger plans in your project's context.

03 v0.3.0 — Rubric Customization Per Project

Your project's plan-quality standards

targeting 2026-09

Different projects have different plan-quality standards. A fintech codebase needs security-weighted scoring; a greenfield startup needs minimalism-weighted scoring. v0.3.0 lets users override Momus's scoring rubric and Alloy's bias set per project.

.anneal/rubric.yaml — project-local override

.anneal/rubric.yamlyaml

# Project-local rubric — overrides global defaults

momus:
  anchors:
    safe_min: 85        # default: 85
    caution_min: 70     # default: 70
    risky_min: 50       # default: 50
  weights:
    correctness: 1.0
    evidence_plan: 1.5  # this project rewards strong evidence planning
    sequencing: 0.8     # this project is ok with parallel phases

alloy:
  bias_set:
    - correctness
    - minimalist
    - defensive
    - domain-security    # custom domain bias
    - domain-compliance  # custom domain bias

temper:
  depth_cap_default: 4
  convergence:
    variance_threshold: 0.25    # tighter than default 0.30
    delta_threshold: 0.10       # tighter than default 0.15

Custom Alloy biases as skill files

Ship biases as SKILL-like files under ~/.claude/skills/anneal-biases/{bias-name}/SKILL.md. Anneal picks them up at tournament time. Enables domain-specific biases: security-first for a fintech codebase, compliance-first for healthcare, performance-first for gamedev.

Rubric composition: Multiple rubrics compose. A user-global rubric in ~/.anneal/rubric.yaml gets overridden by a project-local one; both get overridden by a CLI flag. Precedence: CLI flag > project-local > user-global > Anneal defaults.

04 v1.0.0 — ValidationForge Round-Trip + CLI-First

The feedback loop closes

targeting 2026-12

Anneal emits plans. ValidationForge runs them and reports back. v1.0.0 wires the round-trip: VF verdict folds into Anneal's next run as a new Metis directive, and the loop terminates when VF reports overall PASS or Anneal emits three consecutive ABORTs.

Plan-to-verdict round-trip

.anneal/runs/{run_id}/vf-verdict.yamlyaml

vf_verdict:
  run_id: <anneal-run-id>
  vf_run_id: <vf-run-id>
  plan_path: .anneal/runs/.../plan/
  phase_verdicts:
    phase-00-preflight:           PASS
    phase-01-database-schema:     PASS
    phase-02-sign-in-flow:        FAIL
    phase-03-magic-link-fallback: SKIPPED
    phase-04-session-persistence: SKIPPED
    phase-05-logout-invalidation: SKIPPED
    phase-06-functional-validation: BLOCKED_BY_UPSTREAM
  overall_verdict: FAIL
  first_failing_phase: phase-02-sign-in-flow
  failure_evidence:
    - e2e-evidence/auth/phase-02-oauth-callback.png
    - e2e-evidence/auth/phase-02-sign-in-response.json

`anneal learn` — closing the loop

anneal learn --from-verdict <vf-verdict.yaml> replays the Anneal run with the phase-level failure folded in as a new Metis directive. The resulting re-run is constrained to avoid the failure mode.

The full circle: Anneal plans → VF validates → Anneal learns from VF's verdict → VF re-validates the new plan. The loop terminates when VF reports overall_verdict: PASS or Anneal emits three consecutive ABORT decisions.

CLI-first invocation

headless + CI invocationbash

# Non-interactive CI-friendly invocation
anneal cast "task" --output .anneal/runs/custom-id/
anneal alloy "task" --versions 5 --output .anneal/runs/custom-id/
anneal temper "task" --depth 3 --output .anneal/runs/custom-id/

# Full round-trip: plan → validate → learn if fail
anneal cast "task" | vf-validate --auto-learn

# Cost estimator (pre-flight, no run)
anneal estimate "task"
# → estimated 9 agents, ~$2.50–3.20, ~10-12 min

05 Post-v1.0 — The Long Arc

Directional arcs, soft dates

These are directional targets, not commitments. Priority is demand-driven.

Arc 1

Multi-language Hephaestus runners

Today Anneal plans target projects in any language, but Hephaestus is best-tested against Node.js/Python/Go/Rust. Post-v1.0 targets formal runners for JVM languages (Scala/Kotlin/Java via Gradle/Maven), C/C++/CMake, mobile (iOS via simctl, Android via emulator snapshots), and Data/ML (notebooks, dataset validation, model-artifact capture). Each language gets a hephaestus-runners/{lang}/ directory.

Arc 2

Cloud execution

Today all Anneal runs are local. A Temper depth-5 run on a large codebase can take 15+ minutes. anneal --cloud dispatches to a sandboxed remote runner. Cached probe reports share codebase state across runs. Team runs share telemetry (opt-in) so the auto-select heuristic learns from a team's shared patterns. Requires a hosted backend at anneal.withagents.dev. Local execution stays free-forever.

Arc 3

Custom variant authoring

Ship a variant-authoring-kit: anneal variant new my-variant scaffolds stages 1–3 and 5–7 pre-wired, plus a conformance test suite that validates the eight invariants. Likely first user-authored variants: anneal-orchestra (3 planners across Claude/GPT/Gemini), anneal-foundry (fine-tune local model on codebase before planning), anneal-forge (Anneal inside a nested Anneal — plan the plan).

Arc 4

Integration with deepest-plan and plugin-dev

Anneal, deepest-plan, and plugin-dev have overlapping goals. Post-v1.0 targets a unified invocation where /deepest and /plugin-dev:create-plugin call Anneal under the hood. The short version: Anneal is the planning primitive. Other commands compose it for their domain.

06 Community Backlog

Open requests as of 2026-04-23

Priority is rough; exact order depends on implementation effort and requester volume. To influence order, open an issue at github.com/krzemienski/anneal with a concrete use case.

Feature	Priority	Notes
Streaming stage output	High	Currently stages emit after completion; users want live progress
Dry-run mode	High	Run the pipeline without emitting XML / plan files
Cost estimator pre-flight	High	anneal estimate 'task' — before committing to a run
Plan-as-issue export	Medium	anneal export --format github-issue plan/
Plan editing between stages	Medium	User edits plan_N before Red Team attacks it
Variant cross-pollination	Medium	Alloy variant-N as a Temper seed
Per-phase cost breakdown in rollup	Medium	Currently aggregate-only
Slack/Discord notifications on FAIL	Low	Plugin-authored webhooks

07 What Won’t Ship

Explicit non-goals

These are recurring asks that don't fit the design. The decisions are deliberate, not deferrals.

Code generation from the plan. Anneal stops at the plan. Execution is a separate concern — the emitted XML drives a fresh Claude Code session. Bundling execution into Anneal would bloat scope and fight with ValidationForge's territory.

Interactive plan editing UI. Markdown + YAML is the API. Users who want a UI should build one on top; the plan directory and rollup.yaml are stable contracts designed to be consumed.

Unit-test generation. The eight-invariant contract explicitly forbids mocks, stubs, and test files. Hephaestus validation is the intended quality signal. Anneal will never emit unit tests.

Model-specific optimization. Category routing (ultrabrain/deep/quick) means plans are model-neutral. Anneal will not ship a "works best on Claude 4.7" flag; if a model produces weaker plans, the fix is in the prompt, not Anneal's control plane.

A GUI installer. Claude Code's plugin marketplace is the install path. Anneal will not ship a standalone installer separate from the Claude Code plugin ecosystem.

08 How to Track Progress

Staying up to date

Channel	What it contains	Link
GitHub releases	Every tagged release with release notes since the prior tag	github.com/krzemienski/anneal/releases
CHANGELOG.md	Conventional commits, one entry per user-visible change	repo root
Blog	Per-release writeups; v0.1.0 post is post-19-anneal	withagents.dev
Issue tracker	Milestones map to the versions in this document	github.com/krzemienski/anneal/issues

"Anneal is the planning primitive. Other commands compose it for their domain."Roadmap · Anneal v0.1.0

Anneal docs:Getting Started Cast Alloy Temper Architectures Shared Contract Usage Examples Roadmap