githubinferredactive
adaptive-harness
provenance:github:SeongwoongCho/adaptive-harness
A self-improving harness router for Claude Code.
README
<p align="center">
<h1 align="center">adaptive-harness</h1>
<p align="center">
<strong>A self-improving harness router for Claude Code.</strong><br/>
It watches every task, picks the best workflow, scores the result, and evolves — automatically.
</p>
<p align="center">
<a href="#installation">Install</a> •
<a href="#how-it-works">How It Works</a> •
<a href="#built-in-harnesses">Harnesses</a> •
<a href="#contributing">Contributing</a>
</p>
</p>
---
> **Unlike static skill packs, adaptive-harness gets smarter the more you use it.**
```
You: Fix the login bug where empty email crashes the server
[adaptive-harness]
Classified: bugfix | low uncertainty | local | backend
Selected: tdd-driven (score 0.92) > systematic-debugging (0.81)
[tdd-driven subagent]
1. Write failing test for empty-email path ✓
2. Implement null guard in validateEmail() ✓
3. Run test suite (47/47 pass) ✓
[evaluator]
correctness: 1.00 | completeness: 1.00 | quality: 0.91
robustness: 0.88 | clarity: 0.95 | verifiability: 0.92
overall: 0.94 ← harness weight updated: 1.00 → 1.02
```
After 8 sessions on similar tasks, the router **learns your codebase's patterns** and consistently picks the highest-scoring workflow.
---
## How It Works
```
User Task
│
▼
┌─────────────────────────────────────┐
│ 1. Classify 6-axis taxonomy │
│ 2. Route best harness(es) │
│ 3. Execute subagent pipeline │
│ 4. Evaluate 6-dim scoring │
│ 5. Evolve update weights │
└─────────────────────────────────────┘
```
**Three levels of self-improvement:**
| Level | What improves | How |
|-------|--------------|-----|
| **Routing** | Which harness gets picked | Weights adjust after every evaluation |
| **Content** | What the harness actually does | Evolution manager rewrites agent personas and `skill.md` via A/B testing |
| **Genesis** | Which harnesses exist | Evolution manager creates new harnesses by combining existing ones |
Hard tasks (`uncertainty=high` **and** `verifiability=hard` or `blast_radius=repo-wide`) automatically trigger **ensemble mode** — two harnesses run in parallel, a synthesizer merges the best of both.
---
## Installation
```bash
claude plugin marketplace add https://github.com/SeongwoongCho/adaptive-harness
claude plugin install adaptive-harness@adaptive-harness
```
Then start a new Claude Code session.
---
## Quick Start
```bash
cd your-project
claude # new session — hooks auto-initialize with --general defaults
```
That's it. Every task is now routed through the adaptive-harness pipeline automatically.
```
# Or run explicitly with options
/adaptive-harness:run "Refactor the payment module"
/adaptive-harness:run "Build a new feature" # interview runs by default
/adaptive-harness:run --skip-interview "Build a new feature" # skip interview
/adaptive-harness:run --harness=tdd-driven "Fix the login bug"
```
---
## Built-in Harnesses
| Harness | Best For | Model |
|---------|----------|-------|
| **tdd-driven** | Strict red-green-refactor cycles with enforced test coverage gates | Sonnet |
| **systematic-debugging** | Root cause analysis through structured reproduce-isolate-fix-verify phases | Sonnet |
| **rapid-prototype** | Fast MVP building with speed as the primary constraint | Sonnet |
| **research-iteration** | Hypothesis-driven cycles for high-uncertainty problems with rigorous measurement | Opus |
| **careful-refactor** | Safe refactoring via Mikado method without changing observable behavior | Sonnet |
| **code-review** | Multi-perspective review across security, quality, performance, and maintainability | Opus |
| **migration-safe** | Schema, dependency, and API migrations with audit trails and rollback plans | Sonnet |
| **ralplan-consensus** | Implementation planning with self-review — analyzes, plans, then challenges its own assumptions | Opus |
| **ralph-loop** | Persistent execution loop until all acceptance criteria pass (max iterations bounded) | Sonnet |
| **engineering-retro** | Weekly retrospective with commit history analysis, contributor metrics, trend tracking, and growth coaching | Sonnet |
| **plan-review** | Challenges scope and reviews architecture, quality, tests, and performance one issue at a time with failure mode analysis | Opus |
| **qa-testing** | Tests applications like a real user, computes a health score, and produces a structured report with screenshot evidence | Sonnet |
| **pre-landing-review** | Pre-merge diff review with critical (blocking) and informational (advisory) passes and interactive resolution | Sonnet |
| **ship-workflow** | Automated release: merges main, runs tests, bumps version, generates changelog, creates bisectable commits, and opens a PR | Sonnet |
| **deep-interview** | Resolves ambiguous requirements through structured clarifying interviews, builds a confirmed spec, then executes against it | Opus |
| **simple-executor** | Lightweight executor for trivial, well-defined local changes — no planning overhead | Sonnet |
| **documentation-writer** | Reads source truth first, then drafts accurate and well-styled docs, READMEs, API references, and guides | Sonnet |
| **security-audit** | OWASP Top-10 scan, dependency audit, secrets scan, and threat modeling with a prioritized findings report | Opus |
| **performance-optimization** | Measurement-driven optimization cycles: baseline → profile → hypothesize → implement → measure → verify | Sonnet |
### Experimental Harnesses
| Harness | Best For | Model |
|---------|----------|-------|
| **progressive-refinement** | Iterative quality improvement — rough solution first, then targets weakest dimension each pass | Sonnet |
| **divide-and-conquer** | Splits large tasks into independent sub-tasks, solves in isolation, integrates and verifies | Sonnet |
| **adversarial-review** | Implements a solution, then deliberately tries to break it with adversarial tests and edge-case attacks | Sonnet |
| **spike-then-harden** | Two-phase: fast throwaway prototype to learn the problem space, then production-quality rewrite | Sonnet |
The router supports **harness chaining** — e.g. `plan → execute → review` for complex tasks. Chains are **adaptive**: if a harness discovers mid-execution that the next planned step is wrong, it emits a `next_harness_hint` and the orchestrator reroutes dynamically.
---
## Task Taxonomy (6 Axes)
Every task is classified by LLM reasoning (not keyword matching):
| Axis | Values |
|------|--------|
| `task_type` | bugfix / feature / refactor / research / migration / incident / benchmark |
| `uncertainty` | low / medium / high |
| `blast_radius` | local / cross-module / repo-wide |
| `verifiability` | easy / moderate / hard |
| `latency_sensitivity` | low / high |
| `domain` | backend / frontend / mobile / ml-research / data-engineering / devops / security / infra / docs |
| `domain_hint` | *(optional)* free-text hint for mixed-domain tasks — logged for analytics, not used in routing (e.g., `"also touches devops"`, `"Spark ETL pipeline"`) |
---
## Evaluation Dimensions
Every task result is scored on **6 fixed dimensions** with fixed weights:
| Dimension | Weight | What it measures |
|-----------|--------|-----------------|
| **correctness** | 0.25 | Does the output satisfy stated requirements? |
| **completeness** | 0.20 | Does the output cover the full scope? |
| **quality** | 0.20 | Structural and stylistic quality |
| **robustness** | 0.10 | Edge case and failure mode handling |
| **clarity** | 0.15 | Clear communication of intent |
| **verifiability** | 0.10 | Can the output be independently verified? |
These dimensions apply universally to all task types — code, research, planning, writing, documentation. The evaluator model is auto-routed: Sonnet for simple tasks, Opus for complex ones.
---
## Evolution System
The evolution ma
[truncated…]PUBLIC HISTORY
First discoveredMar 21, 2026
IDENTITY
inferred
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
platformgithub
first seenMar 14, 2026
last updatedMar 19, 2026
last crawledtoday
version—
README BADGE
Add to your README:
