Skip to main content
Anneal · Docs

Roadmap

Three variants today. One unified runtime tomorrow. Cross-plugin integration the year after. This is the living roadmap for Anneal — what's landed, what's next, and what the long arc looks like. Dates are approximate targets; priorities shift based on real user signal.

shipped2026-04-22
v0.1.0
Three variants shipping separately
planned2026-06
v0.2.0
Unified /anneal command + shared state
planned2026-09
v0.3.0
Rubric customization per project
planned2026-12
v1.0.0
ValidationForge round-trip + CLI-first

Current release

released 2026-04-22

All three variants structurally valid, Cast E2E-verified in a live Claude Code session, Alloy and Temper passed load verification.

  • Three installable plugins
    anneal-cast, anneal-alloy, anneal-temper — each a complete Claude Code plugin with its own manifest, commands, skills, agents, hooks, and validation scripts.
  • Eight-invariant contract
    Red Team Trinity always runs, Hephaestus validation always runs, dual output, skill enrichment, unbounded re-loop, parallelization by default, category routing, dual prompts by model family.
  • Greek-god agent roster
    Metis, Prometheus (per-variant), Synthesizer (Alloy), Deepen-Loop (Temper), Momus, Red-Team Trinity, Oracle, Hephaestus, Atlas.
  • Umbrella marketplace
    anneal-umbrella-dev lists all three plugins; install any subset.
  • Opus 4.7 semantic-XML output schema
    _shared/opus-47-xml-schema.md is the canonical format; validator ships as scripts/validate-xml.py.
  • Per-variant validation scripts
    scripts/validate-plugin.py in each variant; runs at install time and on-demand.

Known issues at v0.1.0

  • Alloy's full E2E run is still in progress — load verification passed, real-plan emission pending.
  • Temper's convergence check is deterministic but its scoring rubric is calibrated against a small reference task set; scores on genuinely novel shapes may need adjustment.
  • The three plugins share a _shared/ tree but each vendors its own copy. Divergence risk is real; v0.2.0 addresses this.

Stop typing three different commands

targeting 2026-06

Stop making users type anneal-cast:anneal, anneal-alloy:anneal, anneal-temper:anneal. One command, three backends, auto-select.

Single /anneal command

command syntaxbash
# Old (v0.1.0) — three separate command namespaces
/anneal-cast:anneal "task"
/anneal-alloy:anneal "task" --versions 5
/anneal-temper:anneal "task" --depth 3

# New (v0.2.0) — one command, three backends
/anneal "task"                       # auto-select variant from task shape
/anneal --cast "task"               # force Cast
/anneal --alloy "task" --versions 5  # force Alloy, N=5
/anneal --temper "task" --depth 3    # force Temper, depth cap 3

The auto-select heuristic reads the probe report and Metis's preliminary classification:

Task classificationAuto-selected variant
bug-fix | scoped-refactor | documentationCast
new-feature | infra-change with clear specCast
greenfield-architecture | novel-designAlloy
complex-scoped | iteration-friendlyTemper (default depth 3)
unknownSurface the decision to the user

Shared .anneal/runs/ state directory

directory structuretree
.anneal/
├── runs/
│   ├── {variant}-{run_id}/         # one dir per run, all variants
│   └── ...
├── manifest.json               # index of all runs with metadata
└── comparisons/                # cross-variant comparisons
    └── {task-slug}/
        ├── cast-output/
        ├── alloy-output/
        └── temper-output/
anneal compare <task-slug> renders a side-by-side diff of the three outputs on the same task — the canonical tool for deciding which variant produces stronger plans in your project's context.

Your project's plan-quality standards

targeting 2026-09

Different projects have different plan-quality standards. A fintech codebase needs security-weighted scoring; a greenfield startup needs minimalism-weighted scoring. v0.3.0 lets users override Momus's scoring rubric and Alloy's bias set per project.

.anneal/rubric.yaml — project-local override

.anneal/rubric.yamlyaml
# Project-local rubric — overrides global defaults

momus:
  anchors:
    safe_min: 85        # default: 85
    caution_min: 70     # default: 70
    risky_min: 50       # default: 50
  weights:
    correctness: 1.0
    evidence_plan: 1.5  # this project rewards strong evidence planning
    sequencing: 0.8     # this project is ok with parallel phases

alloy:
  bias_set:
    - correctness
    - minimalist
    - defensive
    - domain-security    # custom domain bias
    - domain-compliance  # custom domain bias

temper:
  depth_cap_default: 4
  convergence:
    variance_threshold: 0.25    # tighter than default 0.30
    delta_threshold: 0.10       # tighter than default 0.15

Custom Alloy biases as skill files

Ship biases as SKILL-like files under ~/.claude/skills/anneal-biases/{bias-name}/SKILL.md. Anneal picks them up at tournament time. Enables domain-specific biases: security-first for a fintech codebase, compliance-first for healthcare, performance-first for gamedev.

Rubric composition: Multiple rubrics compose. A user-global rubric in ~/.anneal/rubric.yaml gets overridden by a project-local one; both get overridden by a CLI flag. Precedence: CLI flag > project-local > user-global > Anneal defaults.

The feedback loop closes

targeting 2026-12

Anneal emits plans. ValidationForge runs them and reports back. v1.0.0 wires the round-trip: VF verdict folds into Anneal's next run as a new Metis directive, and the loop terminates when VF reports overall PASS or Anneal emits three consecutive ABORTs.

Plan-to-verdict round-trip

.anneal/runs/{run_id}/vf-verdict.yamlyaml
vf_verdict:
  run_id: <anneal-run-id>
  vf_run_id: <vf-run-id>
  plan_path: .anneal/runs/.../plan/
  phase_verdicts:
    phase-00-preflight:           PASS
    phase-01-database-schema:     PASS
    phase-02-sign-in-flow:        FAIL
    phase-03-magic-link-fallback: SKIPPED
    phase-04-session-persistence: SKIPPED
    phase-05-logout-invalidation: SKIPPED
    phase-06-functional-validation: BLOCKED_BY_UPSTREAM
  overall_verdict: FAIL
  first_failing_phase: phase-02-sign-in-flow
  failure_evidence:
    - e2e-evidence/auth/phase-02-oauth-callback.png
    - e2e-evidence/auth/phase-02-sign-in-response.json

anneal learnclosing the loop

anneal learn --from-verdict <vf-verdict.yaml> replays the Anneal run with the phase-level failure folded in as a new Metis directive. The resulting re-run is constrained to avoid the failure mode.

The full circle: Anneal plans → VF validates → Anneal learns from VF's verdict → VF re-validates the new plan. The loop terminates when VF reports overall_verdict: PASS or Anneal emits three consecutive ABORT decisions.

CLI-first invocation

headless + CI invocationbash
# Non-interactive CI-friendly invocation
anneal cast "task" --output .anneal/runs/custom-id/
anneal alloy "task" --versions 5 --output .anneal/runs/custom-id/
anneal temper "task" --depth 3 --output .anneal/runs/custom-id/

# Full round-trip: plan → validate → learn if fail
anneal cast "task" | vf-validate --auto-learn

# Cost estimator (pre-flight, no run)
anneal estimate "task"
# → estimated 9 agents, ~$2.50–3.20, ~10-12 min

Directional arcs, soft dates

These are directional targets, not commitments. Priority is demand-driven.

Arc 1
Multi-language Hephaestus runners
Today Anneal plans target projects in any language, but Hephaestus is best-tested against Node.js/Python/Go/Rust. Post-v1.0 targets formal runners for JVM languages (Scala/Kotlin/Java via Gradle/Maven), C/C++/CMake, mobile (iOS via simctl, Android via emulator snapshots), and Data/ML (notebooks, dataset validation, model-artifact capture). Each language gets a hephaestus-runners/{lang}/ directory.
Arc 2
Cloud execution
Today all Anneal runs are local. A Temper depth-5 run on a large codebase can take 15+ minutes. anneal --cloud dispatches to a sandboxed remote runner. Cached probe reports share codebase state across runs. Team runs share telemetry (opt-in) so the auto-select heuristic learns from a team's shared patterns. Requires a hosted backend at anneal.withagents.dev. Local execution stays free-forever.
Arc 3
Custom variant authoring
Ship a variant-authoring-kit: anneal variant new my-variant scaffolds stages 1–3 and 5–7 pre-wired, plus a conformance test suite that validates the eight invariants. Likely first user-authored variants: anneal-orchestra (3 planners across Claude/GPT/Gemini), anneal-foundry (fine-tune local model on codebase before planning), anneal-forge (Anneal inside a nested Anneal — plan the plan).
Arc 4
Integration with deepest-plan and plugin-dev
Anneal, deepest-plan, and plugin-dev have overlapping goals. Post-v1.0 targets a unified invocation where /deepest and /plugin-dev:create-plugin call Anneal under the hood. The short version: Anneal is the planning primitive. Other commands compose it for their domain.

Open requests as of 2026-04-23

Priority is rough; exact order depends on implementation effort and requester volume. To influence order, open an issue at github.com/krzemienski/anneal with a concrete use case.

FeaturePriorityNotes
Streaming stage outputHighCurrently stages emit after completion; users want live progress
Dry-run modeHighRun the pipeline without emitting XML / plan files
Cost estimator pre-flightHighanneal estimate 'task' — before committing to a run
Plan-as-issue exportMediumanneal export --format github-issue plan/
Plan editing between stagesMediumUser edits plan_N before Red Team attacks it
Variant cross-pollinationMediumAlloy variant-N as a Temper seed
Per-phase cost breakdown in rollupMediumCurrently aggregate-only
Slack/Discord notifications on FAILLowPlugin-authored webhooks

Explicit non-goals

These are recurring asks that don't fit the design. The decisions are deliberate, not deferrals.

Code generation from the plan. Anneal stops at the plan. Execution is a separate concern — the emitted XML drives a fresh Claude Code session. Bundling execution into Anneal would bloat scope and fight with ValidationForge's territory.
Interactive plan editing UI. Markdown + YAML is the API. Users who want a UI should build one on top; the plan directory and rollup.yaml are stable contracts designed to be consumed.
Unit-test generation. The eight-invariant contract explicitly forbids mocks, stubs, and test files. Hephaestus validation is the intended quality signal. Anneal will never emit unit tests.
Model-specific optimization. Category routing (ultrabrain/deep/quick) means plans are model-neutral. Anneal will not ship a "works best on Claude 4.7" flag; if a model produces weaker plans, the fix is in the prompt, not Anneal's control plane.
A GUI installer. Claude Code's plugin marketplace is the install path. Anneal will not ship a standalone installer separate from the Claude Code plugin ecosystem.

Staying up to date

ChannelWhat it containsLink
GitHub releasesEvery tagged release with release notes since the prior taggithub.com/krzemienski/anneal/releases
CHANGELOG.mdConventional commits, one entry per user-visible changerepo root
BlogPer-release writeups; v0.1.0 post is post-19-annealwithagents.dev
Issue trackerMilestones map to the versions in this documentgithub.com/krzemienski/anneal/issues
"Anneal is the planning primitive. Other commands compose it for their domain."Roadmap · Anneal v0.1.0