Roadmap
Three variants today. One unified runtime tomorrow. Cross-plugin integration the year after. This is the living roadmap for Anneal — what's landed, what's next, and what the long arc looks like. Dates are approximate targets; priorities shift based on real user signal.
Current release
released 2026-04-22All three variants structurally valid, Cast E2E-verified in a live Claude Code session, Alloy and Temper passed load verification.
- Three installable plugins
anneal-cast,anneal-alloy,anneal-temper— each a complete Claude Code plugin with its own manifest, commands, skills, agents, hooks, and validation scripts. - Eight-invariant contractRed Team Trinity always runs, Hephaestus validation always runs, dual output, skill enrichment, unbounded re-loop, parallelization by default, category routing, dual prompts by model family.
- Greek-god agent rosterMetis, Prometheus (per-variant), Synthesizer (Alloy), Deepen-Loop (Temper), Momus, Red-Team Trinity, Oracle, Hephaestus, Atlas.
- Umbrella marketplace
anneal-umbrella-devlists all three plugins; install any subset. - Opus 4.7 semantic-XML output schema
_shared/opus-47-xml-schema.mdis the canonical format; validator ships asscripts/validate-xml.py. - Per-variant validation scripts
scripts/validate-plugin.pyin each variant; runs at install time and on-demand.
Known issues at v0.1.0
- Alloy's full E2E run is still in progress — load verification passed, real-plan emission pending.
- Temper's convergence check is deterministic but its scoring rubric is calibrated against a small reference task set; scores on genuinely novel shapes may need adjustment.
- The three plugins share a
_shared/tree but each vendors its own copy. Divergence risk is real; v0.2.0 addresses this.
Stop typing three different commands
targeting 2026-06Stop making users type anneal-cast:anneal, anneal-alloy:anneal, anneal-temper:anneal. One command, three backends, auto-select.
Single /anneal command
# Old (v0.1.0) — three separate command namespaces /anneal-cast:anneal "task" /anneal-alloy:anneal "task" --versions 5 /anneal-temper:anneal "task" --depth 3 # New (v0.2.0) — one command, three backends /anneal "task" # auto-select variant from task shape /anneal --cast "task" # force Cast /anneal --alloy "task" --versions 5 # force Alloy, N=5 /anneal --temper "task" --depth 3 # force Temper, depth cap 3
The auto-select heuristic reads the probe report and Metis's preliminary classification:
| Task classification | Auto-selected variant |
|---|---|
bug-fix | scoped-refactor | documentation | Cast |
new-feature | infra-change with clear spec | Cast |
greenfield-architecture | novel-design | Alloy |
complex-scoped | iteration-friendly | Temper (default depth 3) |
unknown | Surface the decision to the user |
Shared .anneal/runs/ state directory
.anneal/ ├── runs/ │ ├── {variant}-{run_id}/ # one dir per run, all variants │ └── ... ├── manifest.json # index of all runs with metadata └── comparisons/ # cross-variant comparisons └── {task-slug}/ ├── cast-output/ ├── alloy-output/ └── temper-output/
anneal compare <task-slug> renders a side-by-side diff of the three outputs on the same task — the canonical tool for deciding which variant produces stronger plans in your project's context.Your project's plan-quality standards
targeting 2026-09Different projects have different plan-quality standards. A fintech codebase needs security-weighted scoring; a greenfield startup needs minimalism-weighted scoring. v0.3.0 lets users override Momus's scoring rubric and Alloy's bias set per project.
.anneal/rubric.yaml — project-local override
# Project-local rubric — overrides global defaults momus: anchors: safe_min: 85 # default: 85 caution_min: 70 # default: 70 risky_min: 50 # default: 50 weights: correctness: 1.0 evidence_plan: 1.5 # this project rewards strong evidence planning sequencing: 0.8 # this project is ok with parallel phases alloy: bias_set: - correctness - minimalist - defensive - domain-security # custom domain bias - domain-compliance # custom domain bias temper: depth_cap_default: 4 convergence: variance_threshold: 0.25 # tighter than default 0.30 delta_threshold: 0.10 # tighter than default 0.15
Custom Alloy biases as skill files
Ship biases as SKILL-like files under ~/.claude/skills/anneal-biases/{bias-name}/SKILL.md. Anneal picks them up at tournament time. Enables domain-specific biases: security-first for a fintech codebase, compliance-first for healthcare, performance-first for gamedev.
~/.anneal/rubric.yaml gets overridden by a project-local one; both get overridden by a CLI flag. Precedence: CLI flag > project-local > user-global > Anneal defaults.The feedback loop closes
targeting 2026-12Anneal emits plans. ValidationForge runs them and reports back. v1.0.0 wires the round-trip: VF verdict folds into Anneal's next run as a new Metis directive, and the loop terminates when VF reports overall PASS or Anneal emits three consecutive ABORTs.
Plan-to-verdict round-trip
vf_verdict: run_id: <anneal-run-id> vf_run_id: <vf-run-id> plan_path: .anneal/runs/.../plan/ phase_verdicts: phase-00-preflight: PASS phase-01-database-schema: PASS phase-02-sign-in-flow: FAIL phase-03-magic-link-fallback: SKIPPED phase-04-session-persistence: SKIPPED phase-05-logout-invalidation: SKIPPED phase-06-functional-validation: BLOCKED_BY_UPSTREAM overall_verdict: FAIL first_failing_phase: phase-02-sign-in-flow failure_evidence: - e2e-evidence/auth/phase-02-oauth-callback.png - e2e-evidence/auth/phase-02-sign-in-response.json
anneal learn — closing the loop
anneal learn --from-verdict <vf-verdict.yaml> replays the Anneal run with the phase-level failure folded in as a new Metis directive. The resulting re-run is constrained to avoid the failure mode.
The full circle: Anneal plans → VF validates → Anneal learns from VF's verdict → VF re-validates the new plan. The loop terminates when VF reports overall_verdict: PASS or Anneal emits three consecutive ABORT decisions.
CLI-first invocation
# Non-interactive CI-friendly invocation anneal cast "task" --output .anneal/runs/custom-id/ anneal alloy "task" --versions 5 --output .anneal/runs/custom-id/ anneal temper "task" --depth 3 --output .anneal/runs/custom-id/ # Full round-trip: plan → validate → learn if fail anneal cast "task" | vf-validate --auto-learn # Cost estimator (pre-flight, no run) anneal estimate "task" # → estimated 9 agents, ~$2.50–3.20, ~10-12 min
Directional arcs, soft dates
These are directional targets, not commitments. Priority is demand-driven.
hephaestus-runners/{lang}/ directory.anneal --cloud dispatches to a sandboxed remote runner. Cached probe reports share codebase state across runs. Team runs share telemetry (opt-in) so the auto-select heuristic learns from a team's shared patterns. Requires a hosted backend at anneal.withagents.dev. Local execution stays free-forever.variant-authoring-kit: anneal variant new my-variant scaffolds stages 1–3 and 5–7 pre-wired, plus a conformance test suite that validates the eight invariants. Likely first user-authored variants: anneal-orchestra (3 planners across Claude/GPT/Gemini), anneal-foundry (fine-tune local model on codebase before planning), anneal-forge (Anneal inside a nested Anneal — plan the plan).deepest-plan, and plugin-dev have overlapping goals. Post-v1.0 targets a unified invocation where /deepest and /plugin-dev:create-plugin call Anneal under the hood. The short version: Anneal is the planning primitive. Other commands compose it for their domain.Open requests as of 2026-04-23
Priority is rough; exact order depends on implementation effort and requester volume. To influence order, open an issue at github.com/krzemienski/anneal with a concrete use case.
| Feature | Priority | Notes |
|---|---|---|
| Streaming stage output | High | Currently stages emit after completion; users want live progress |
| Dry-run mode | High | Run the pipeline without emitting XML / plan files |
| Cost estimator pre-flight | High | anneal estimate 'task' — before committing to a run |
| Plan-as-issue export | Medium | anneal export --format github-issue plan/ |
| Plan editing between stages | Medium | User edits plan_N before Red Team attacks it |
| Variant cross-pollination | Medium | Alloy variant-N as a Temper seed |
| Per-phase cost breakdown in rollup | Medium | Currently aggregate-only |
| Slack/Discord notifications on FAIL | Low | Plugin-authored webhooks |
Explicit non-goals
These are recurring asks that don't fit the design. The decisions are deliberate, not deferrals.
Staying up to date
| Channel | What it contains | Link |
|---|---|---|
| GitHub releases | Every tagged release with release notes since the prior tag | github.com/krzemienski/anneal/releases |
| CHANGELOG.md | Conventional commits, one entry per user-visible change | repo root |
| Blog | Per-release writeups; v0.1.0 post is post-19-anneal | withagents.dev |
| Issue tracker | Milestones map to the versions in this document | github.com/krzemienski/anneal/issues |
"Anneal is the planning primitive. Other commands compose it for their domain."Roadmap · Anneal v0.1.0