HealthFlow
HealthFlow is a system that helps AI agents learn and improve how they complete tasks, particularly those involving complex data like medical records. It works by repeatedly planning, executing, evaluating, and reflecting on each attempt, using past experiences to guide future actions. This addresses the challenge of getting AI to reliably handle intricate processes where mistakes can have serious consequences. Researchers and developers working on AI applications in fields like healthcare would find it valuable, as it provides a structured way to build more robust and adaptable systems. What sets HealthFlow apart is its focus on detailed analysis of both successful and unsuccessful attempts, allowing the AI to learn from its errors and refine its approach over time.
README
# HealthFlow: A Self-Evolving MERF Runtime for CodeAct Analysis [](https://arxiv.org/abs/2508.02621) [](https://healthflow-agent.netlify.app) HealthFlow is a research framework for **self-evolving task execution with a four-stage Meta -> Executor -> Evaluator -> Reflector loop**. The core runtime is organized around planning, CodeAct-style execution, structured evaluation, per-task runtime artifacts, and long-term reflective memory. Dataset preparation and benchmark evaluation workflows can still live in the repository under `data/`, but they are intentionally decoupled from the `healthflow/` runtime package. - structured `Meta` planning with EHR-adaptive memory retrieval - `Executor` as a CodeAct runtime over external executor backends - `Evaluator`-driven retry and failure diagnosis - `Reflector` writeback from both successful and failed trajectories - inspectable workspace artifacts and run telemetry HealthFlow compares external coding agents through a shared executor abstraction. The maintained built-in backends are `claude_code`, `codex`, `opencode`, and `pi`, with `opencode` as the default. The current release surface is intentionally **backend and CLI only**. A frontend is not shipped in this repo at this stage. ## Core Runtime HealthFlow runs a lean **Meta -> Executor -> Evaluator -> Reflector** loop. 1. **Meta**: retrieve relevant safeguard, workflow, dataset, and execution memories, then emit a structured execution plan. 2. **Executor**: interpret the plan as a CodeAct brief and act through code, commands, and workspace artifacts using whatever tools are already configured in the outer executor. 3. **Evaluator**: review the execution trace and produced artifacts, classify the outcome as `success`, `needs_retry`, or `failed`, and provide repair instructions for the next attempt. 4. **Reflector**: synthesize reusable safeguard, workflow, dataset, or execution memories from the full trajectory after the task session ends. The task-level self-correction budget is controlled by `system.max_attempts`, which counts total full attempts through the loop rather than "retries plus one". ## What HealthFlow Contributes - **MERF core runtime**: the framework definition is the four-stage Meta, Executor, Evaluator, Reflector loop rather than an outer benchmark-evaluation pipeline. - **Lean execution contract**: HealthFlow defines workspace rules, execution-environment defaults, and workflow recommendations without becoming a tool-hosting framework. - **Inspectable memory**: safeguard, workflow, dataset, and execution memories are stored in JSONL, routed through adaptive retrieval lanes, and exposed through a saved retrieval audit. - **Evaluator-centered recovery**: retries are driven by structured failure diagnosis and repair instructions instead of a single scalar score alone. - **Reproducibility contract**: every task workspace writes structured runtime artifacts instead of only human-readable logs. - **Executor telemetry**: run artifacts capture executor metadata, backend versions when available, LLM usage, executor usage, and stage-level estimated cost summaries. - **Role-specific runtime models**: planner, evaluator, reflector, and executor can be configured against different model entries to reduce single-model coupling. ## Workspace Artifacts Runtime state lives under `workspace/` by default: - task artifacts: `workspace/tasks/<task_id>/` - long-term memory: `workspace/memory/experience.jsonl` Dataset preparation and benchmark evaluation assets remain under `data/`; they are outside the `healthflow/` package boundary. Each task creates a workspace under `workspace/tasks/<task_id>/` and writes: - `sandbox/` - executor-visible inputs and produced deliverables only - `runtime/index.json` - `runtime/events.jsonl` - `runtime/run/summary.json` - `runtime/run/trajectory.json` - `runtime/run/costs.json` - `runtime/run/final_evaluation.json` - `runtime/attempts/attempt_*/` - `planner/`: input messages, raw output, parsed output, call metadata, repair trace, plan markdown - `executor/`: prompt, command, stdout, stderr, combined log, telemetry, usage, artifact index - `evaluator/`: input messages, raw output, parsed output, call metadata, repair trace When `healthflow run ... --report` is enabled, the same workspace also writes: - `runtime/report.md` These files are the main source of truth for rebuttal-oriented inspection. `runtime/report.md` is a standard HealthFlow-generated markdown report that summarizes the run, links sandbox deliverables with relative paths, embeds a small number of images inline, and keeps runtime JSON/log files in a separate audit section. ## Runtime Boundary - **Core runtime**: the MERF loop in `healthflow/system.py`. - **Domain specialization**: EHR-specific helpers under `healthflow/ehr/`. - **Dataset prep and benchmark evaluation**: repository-level workflows under `data/`, intentionally decoupled from `healthflow/`. The framework package is focused on taking a task, executing it, improving task success rate across attempts, and writing inspectable artifacts and reports for each task run. ## Memory Behavior HealthFlow uses four memory classes: - `safeguard` - `workflow` - `dataset` - `execution` Retrieval is inspectable: - retrieval is conditioned on task family, dataset signature, schema tags, and EHR risk tags - safeguard memories are prioritized for elevated-risk EHR tasks - contradictory memories are tracked by `conflict_slot` - safeguard memories suppress conflicting workflow or execution memories before planning - dataset memories act as anchors without replacing workflow guidance - the retrieval audit is saved per attempt under `runtime/attempts/attempt_*/memory/retrieval_result.json` Writeback behavior: - failed runs and near-miss recoveries can produce `safeguard` memory - successful reusable procedures can produce `workflow` memory - stable schema observations can produce `dataset` memory - reusable task-completion habits can produce `execution` memory ## Supported Execution Backends HealthFlow keeps the executor layer backend-agnostic, but the public surface is intentionally small: - `opencode` (default) - `claude_code` - `codex` - `pi` You can still define additional CLI backends in `config.toml`, but the harness logic stays in HealthFlow rather than being baked into one external backend. Executor-specific repository instruction files are intentionally avoided at the repo root so backend comparisons use the same injected prompt guidance. ## External CLI Workflows HealthFlow does not implement an internal MCP registry, plugin framework, or large CLI catalog. Tool availability belongs to the outer executor layer such as Claude Code, OpenCode, Pi, or Codex. HealthFlow only supplies: - a lightweight execution-environment contract - small workflow recommendations - documentation recipes for selected external CLIs When external CLIs are part of the supported workflow, prefer declaring them in this project's `pyproject.toml` and installing them into the shared repo `.venv`. Executor backends should use that same project environment rather than ad hoc global tool installs. Executor defaults are configured for normal text output. HealthFlow does not require external backends to finish in JSON. Structured event streams remain optional backend-specific telemetry modes. `run_benchmark.py` always forces `memory.write_policy = "freeze"` so benchmark evaluation remains decoupled from the framework's self-evolving writeback behavior. ## Quick Start ### Prerequisites - Python 3.12+ - `uv` - one execution backend available in `PATH` - default: `opencode` - alternatives: `claude`, `codex`, `pi` ### Setup ```bash uv sync source .venv/bin/activate export ZENMUX_API_KEY="your_zenmux_key_here" export DEEPSEEK_API_KEY="your_deepseek [truncated…]
PUBLIC HISTORY
IDENTITY
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
README BADGE
Add to your README:
