AGENTS / GITHUB / HealthFlow
githubinferredactive

HealthFlow

provenance:github:yhzhu99/HealthFlow
WHAT THIS AGENT DOES

HealthFlow is a system that helps AI agents learn and improve how they complete tasks, particularly those involving complex data like medical records. It works by repeatedly planning, executing, evaluating, and reflecting on each attempt, using past experiences to guide future actions. This addresses the challenge of getting AI to reliably handle intricate processes where mistakes can have serious consequences. Researchers and developers working on AI applications in fields like healthcare would find it valuable, as it provides a structured way to build more robust and adaptable systems. What sets HealthFlow apart is its focus on detailed analysis of both successful and unsuccessful attempts, allowing the AI to learn from its errors and refine its approach over time.

View Source ↗First seen 10mo agoNot yet hireable
README
# HealthFlow: A Self-Evolving MERF Runtime for CodeAct Analysis

[![arXiv](https://img.shields.io/badge/arXiv-2508.02621-b31b1b.svg)](https://arxiv.org/abs/2508.02621)
[![Project Website](https://img.shields.io/badge/Project%20Website-HealthFlow-0066cc.svg)](https://healthflow-agent.netlify.app)

HealthFlow is a research framework for **self-evolving task execution with a four-stage Meta -> Executor -> Evaluator -> Reflector loop**. The core runtime is organized around planning, CodeAct-style execution, structured evaluation, per-task runtime artifacts, and long-term reflective memory. Dataset preparation and benchmark evaluation workflows can still live in the repository under `data/`, but they are intentionally decoupled from the `healthflow/` runtime package.

- structured `Meta` planning with EHR-adaptive memory retrieval
- `Executor` as a CodeAct runtime over external executor backends
- `Evaluator`-driven retry and failure diagnosis
- `Reflector` writeback from both successful and failed trajectories
- inspectable workspace artifacts and run telemetry

HealthFlow compares external coding agents through a shared executor abstraction. The maintained built-in backends are `claude_code`, `codex`, `opencode`, and `pi`, with `opencode` as the default.

The current release surface is intentionally **backend and CLI only**. A frontend is not shipped in this repo at this stage.

## Core Runtime

HealthFlow runs a lean **Meta -> Executor -> Evaluator -> Reflector** loop.

1. **Meta**: retrieve relevant safeguard, workflow, dataset, and execution memories, then emit a structured execution plan.
2. **Executor**: interpret the plan as a CodeAct brief and act through code, commands, and workspace artifacts using whatever tools are already configured in the outer executor.
3. **Evaluator**: review the execution trace and produced artifacts, classify the outcome as `success`, `needs_retry`, or `failed`, and provide repair instructions for the next attempt.
4. **Reflector**: synthesize reusable safeguard, workflow, dataset, or execution memories from the full trajectory after the task session ends.

The task-level self-correction budget is controlled by `system.max_attempts`, which counts total full attempts through the loop rather than "retries plus one".

## What HealthFlow Contributes

- **MERF core runtime**: the framework definition is the four-stage Meta, Executor, Evaluator, Reflector loop rather than an outer benchmark-evaluation pipeline.
- **Lean execution contract**: HealthFlow defines workspace rules, execution-environment defaults, and workflow recommendations without becoming a tool-hosting framework.
- **Inspectable memory**: safeguard, workflow, dataset, and execution memories are stored in JSONL, routed through adaptive retrieval lanes, and exposed through a saved retrieval audit.
- **Evaluator-centered recovery**: retries are driven by structured failure diagnosis and repair instructions instead of a single scalar score alone.
- **Reproducibility contract**: every task workspace writes structured runtime artifacts instead of only human-readable logs.
- **Executor telemetry**: run artifacts capture executor metadata, backend versions when available, LLM usage, executor usage, and stage-level estimated cost summaries.
- **Role-specific runtime models**: planner, evaluator, reflector, and executor can be configured against different model entries to reduce single-model coupling.

## Workspace Artifacts

Runtime state lives under `workspace/` by default:

- task artifacts: `workspace/tasks/<task_id>/`
- long-term memory: `workspace/memory/experience.jsonl`

Dataset preparation and benchmark evaluation assets remain under `data/`; they are outside the `healthflow/` package boundary.

Each task creates a workspace under `workspace/tasks/<task_id>/` and writes:

- `sandbox/`
  - executor-visible inputs and produced deliverables only
- `runtime/index.json`
- `runtime/events.jsonl`
- `runtime/run/summary.json`
- `runtime/run/trajectory.json`
- `runtime/run/costs.json`
- `runtime/run/final_evaluation.json`
- `runtime/attempts/attempt_*/`
  - `planner/`: input messages, raw output, parsed output, call metadata, repair trace, plan markdown
  - `executor/`: prompt, command, stdout, stderr, combined log, telemetry, usage, artifact index
  - `evaluator/`: input messages, raw output, parsed output, call metadata, repair trace

When `healthflow run ... --report` is enabled, the same workspace also writes:

- `runtime/report.md`

These files are the main source of truth for rebuttal-oriented inspection.
`runtime/report.md` is a standard HealthFlow-generated markdown report that summarizes the run, links sandbox deliverables with relative paths, embeds a small number of images inline, and keeps runtime JSON/log files in a separate audit section.

## Runtime Boundary

- **Core runtime**: the MERF loop in `healthflow/system.py`.
- **Domain specialization**: EHR-specific helpers under `healthflow/ehr/`.
- **Dataset prep and benchmark evaluation**: repository-level workflows under `data/`, intentionally decoupled from `healthflow/`.

The framework package is focused on taking a task, executing it, improving task success rate across attempts, and writing inspectable artifacts and reports for each task run.

## Memory Behavior

HealthFlow uses four memory classes:

- `safeguard`
- `workflow`
- `dataset`
- `execution`

Retrieval is inspectable:

- retrieval is conditioned on task family, dataset signature, schema tags, and EHR risk tags
- safeguard memories are prioritized for elevated-risk EHR tasks
- contradictory memories are tracked by `conflict_slot`
- safeguard memories suppress conflicting workflow or execution memories before planning
- dataset memories act as anchors without replacing workflow guidance
- the retrieval audit is saved per attempt under `runtime/attempts/attempt_*/memory/retrieval_result.json`

Writeback behavior:

- failed runs and near-miss recoveries can produce `safeguard` memory
- successful reusable procedures can produce `workflow` memory
- stable schema observations can produce `dataset` memory
- reusable task-completion habits can produce `execution` memory

## Supported Execution Backends

HealthFlow keeps the executor layer backend-agnostic, but the public surface is intentionally small:

- `opencode` (default)
- `claude_code`
- `codex`
- `pi`

You can still define additional CLI backends in `config.toml`, but the harness logic stays in HealthFlow rather than being baked into one external backend.
Executor-specific repository instruction files are intentionally avoided at the repo root so backend comparisons use the same injected prompt guidance.

## External CLI Workflows

HealthFlow does not implement an internal MCP registry, plugin framework, or large CLI catalog. Tool availability belongs to the outer executor layer such as Claude Code, OpenCode, Pi, or Codex.

HealthFlow only supplies:

- a lightweight execution-environment contract
- small workflow recommendations
- documentation recipes for selected external CLIs

When external CLIs are part of the supported workflow, prefer declaring them in this project's `pyproject.toml` and installing them into the shared repo `.venv`. Executor backends should use that same project environment rather than ad hoc global tool installs.

Executor defaults are configured for normal text output. HealthFlow does not require external backends to finish in JSON. Structured event streams remain optional backend-specific telemetry modes.

`run_benchmark.py` always forces `memory.write_policy = "freeze"` so benchmark evaluation remains decoupled from the framework's self-evolving writeback behavior.

## Quick Start

### Prerequisites

- Python 3.12+
- `uv`
- one execution backend available in `PATH`
  - default: `opencode`
  - alternatives: `claude`, `codex`, `pi`

### Setup

```bash
uv sync
source .venv/bin/activate
export ZENMUX_API_KEY="your_zenmux_key_here"
export DEEPSEEK_API_KEY="your_deepseek

[truncated…]

PUBLIC HISTORY

First discoveredApr 1, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub
first seenMay 28, 2025
last updatedMar 31, 2026
last crawledtoday
version

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:yhzhu99/HealthFlow)