adaptive-harness

provenance:github:SeongwoongCho/adaptive-harness

WHAT THIS AGENT DOES

adaptive-harness is a Python-based agent designed for the GitHub platform. It functions as a self-improving harness router specifically tailored for Claude Code. This agent facilitates AI orchestration and routing within AI workflows, leveraging Claude Code's capabilities. Developers and researchers working with Claude Code and multi-agent systems would find it useful for automating and optimizing their workflows. The agent's self-improving nature aims to enhance its routing capabilities over time.

PROBLEM IT SOLVES

adaptive-harness solves the problem of efficiently routing tasks to Claude Code, particularly within complex AI workflows. Manually managing this routing can be time-consuming and error-prone, making this agent a valuable tool for automation.

View Source ↗First seen 2mo agoNot yet hireable

CAPABILITIES & CONSTRAINTS

TECH & STACK

pythonclaude-codeai-agentautomationai-routingdeveloper-tools

README

<p align="center">
  <h1 align="center">adaptive-harness</h1>
  <p align="center">
    <strong>A self-improving harness router for Claude Code.</strong><br/>
    It watches every task, picks the best workflow, scores the result, and evolves — automatically.
  </p>
  <p align="center">
    <a href="#installation">Install</a> &nbsp;&bull;&nbsp;
    <a href="#how-it-works">How It Works</a> &nbsp;&bull;&nbsp;
    <a href="#built-in-harnesses">Harnesses</a> &nbsp;&bull;&nbsp;
    <a href="#contributing">Contributing</a>
  </p>
</p>

---

> **Unlike static skill packs, adaptive-harness gets smarter the more you use it.**

```
You: Fix the login bug where empty email crashes the server

[adaptive-harness]
  Classified:  bugfix | low uncertainty | local | backend
  Selected:    tdd-driven (score 0.92)  >  systematic-debugging (0.81)

[tdd-driven subagent]
  1. Write failing test for empty-email path   ✓
  2. Implement null guard in validateEmail()   ✓
  3. Run test suite (47/47 pass)               ✓

[evaluator]
  correctness: 1.00 | completeness: 1.00 | quality: 0.91
  robustness: 0.88 | clarity: 0.95 | verifiability: 0.92
  overall: 0.94  ← harness weight updated: 1.00 → 1.02
```

After 8 sessions on similar tasks, the router **learns your codebase's patterns** and consistently picks the highest-scoring workflow.

---

## How It Works

```
User Task
    │
    ▼
┌─────────────────────────────────────┐
│  1. Classify    6-axis taxonomy     │
│  2. Route       best harness(es)    │
│  3. Execute     subagent pipeline   │
│  4. Evaluate    6-dim scoring       │
│  5. Evolve      update weights      │
└─────────────────────────────────────┘
```

**Three levels of self-improvement:**

| Level | What improves | How |
|-------|--------------|-----|
| **Routing** | Which harness gets picked | Weights adjust after every evaluation |
| **Content** | What the harness actually does | Evolution manager rewrites agent personas and `skill.md` via A/B testing |
| **Genesis** | Which harnesses exist | Evolution manager creates new harnesses by combining existing ones |

Hard tasks (`uncertainty=high` **and** `verifiability=hard` or `blast_radius=repo-wide`) automatically trigger **ensemble mode** — two harnesses run in parallel, a synthesizer merges the best of both.

---

## Installation

```bash
claude plugin marketplace add https://github.com/SeongwoongCho/adaptive-harness
claude plugin install adaptive-harness@adaptive-harness
```

Then start a new Claude Code session.

---

## Quick Start

```bash
cd your-project
claude                              # new session — hooks auto-initialize with --general defaults
```

That's it. Every task is now routed through the adaptive-harness pipeline automatically.

```
# Or run explicitly with options
/adaptive-harness:run "Refactor the payment module"
/adaptive-harness:run "Build a new feature"              # interview runs by default
/adaptive-harness:run --skip-interview "Build a new feature"  # skip interview
/adaptive-harness:run --harness=tdd-driven "Fix the login bug"
```

---

## Built-in Harnesses

| Harness | Best For | Model |
|---------|----------|-------|
| **tdd-driven** | Strict red-green-refactor cycles with enforced test coverage gates | Sonnet |
| **systematic-debugging** | Root cause analysis through structured reproduce-isolate-fix-verify phases | Sonnet |
| **rapid-prototype** | Fast MVP building with speed as the primary constraint | Sonnet |
| **research-iteration** | Hypothesis-driven cycles for high-uncertainty problems with rigorous measurement | Opus |
| **careful-refactor** | Safe refactoring via Mikado method without changing observable behavior | Sonnet |
| **code-review** | Multi-perspective review across security, quality, performance, and maintainability | Opus |
| **migration-safe** | Schema, dependency, and API migrations with audit trails and rollback plans | Sonnet |
| **ralplan-consensus** | Implementation planning with self-review — analyzes, plans, then challenges its own assumptions | Opus |
| **ralph-loop** | Persistent execution loop until all acceptance criteria pass (max iterations bounded) | Sonnet |
| **engineering-retro** | Weekly retrospective with commit history analysis, contributor metrics, trend tracking, and growth coaching | Sonnet |
| **plan-review** | Challenges scope and reviews architecture, quality, tests, and performance one issue at a time with failure mode analysis | Opus |
| **qa-testing** | Tests applications like a real user, computes a health score, and produces a structured report with screenshot evidence | Sonnet |
| **pre-landing-review** | Pre-merge diff review with critical (blocking) and informational (advisory) passes and interactive resolution | Sonnet |
| **ship-workflow** | Automated release: merges main, runs tests, bumps version, generates changelog, creates bisectable commits, and opens a PR | Sonnet |
| **deep-interview** | Resolves ambiguous requirements through structured clarifying interviews, builds a confirmed spec, then executes against it | Opus |
| **simple-executor** | Lightweight executor for trivial, well-defined local changes — no planning overhead | Sonnet |
| **documentation-writer** | Reads source truth first, then drafts accurate and well-styled docs, READMEs, API references, and guides | Sonnet |
| **security-audit** | OWASP Top-10 scan, dependency audit, secrets scan, and threat modeling with a prioritized findings report | Opus |
| **performance-optimization** | Measurement-driven optimization cycles: baseline → profile → hypothesize → implement → measure → verify | Sonnet |

### Experimental Harnesses

| Harness | Best For | Model |
|---------|----------|-------|
| **progressive-refinement** | Iterative quality improvement — rough solution first, then targets weakest dimension each pass | Sonnet |
| **divide-and-conquer** | Splits large tasks into independent sub-tasks, solves in isolation, integrates and verifies | Sonnet |
| **adversarial-review** | Implements a solution, then deliberately tries to break it with adversarial tests and edge-case attacks | Sonnet |
| **spike-then-harden** | Two-phase: fast throwaway prototype to learn the problem space, then production-quality rewrite | Sonnet |

The router supports **harness chaining** — e.g. `plan → execute → review` for complex tasks. Chains are **adaptive**: if a harness discovers mid-execution that the next planned step is wrong, it emits a `next_harness_hint` and the orchestrator reroutes dynamically.

---

## Task Taxonomy (6 Axes)

Every task is classified by LLM reasoning (not keyword matching):

| Axis | Values |
|------|--------|
| `task_type` | bugfix / feature / refactor / research / migration / incident / benchmark |
| `uncertainty` | low / medium / high |
| `blast_radius` | local / cross-module / repo-wide |
| `verifiability` | easy / moderate / hard |
| `latency_sensitivity` | low / high |
| `domain` | backend / frontend / mobile / ml-research / data-engineering / devops / security / infra / docs |
| `domain_hint` | *(optional)* free-text hint for mixed-domain tasks — logged for analytics, not used in routing (e.g., `"also touches devops"`, `"Spark ETL pipeline"`) |

---

## Evaluation Dimensions

Every task result is scored on **6 fixed dimensions** with fixed weights:

| Dimension | Weight | What it measures |
|-----------|--------|-----------------|
| **correctness** | 0.25 | Does the output satisfy stated requirements? |
| **completeness** | 0.20 | Does the output cover the full scope? |
| **quality** | 0.20 | Structural and stylistic quality |
| **robustness** | 0.10 | Edge case and failure mode handling |
| **clarity** | 0.15 | Clear communication of intent |
| **verifiability** | 0.10 | Can the output be independently verified? |

These dimensions apply universally to all task types — code, research, planning, writing, documentation. The evaluator model is auto-routed: Sonnet for simple tasks, Opus for complex ones.

---

## Evolution System

The evolution ma

[truncated…]

PUBLIC HISTORY

First discoveredMar 21, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 14, 2026

last updatedMar 19, 2026

last crawled1 months ago

version—

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:SeongwoongCho/adaptive-harness)