genoclaw
GenoClaw is an AI system that learns and improves itself over time. It acts as a digital assistant, capable of completing tasks by using various tools like searching the web, writing code, or accessing files. This system is designed to solve complex business problems that require ongoing adaptation and refinement, such as automating research, managing projects, or generating reports. Business leaders, researchers, and anyone needing a consistently improving digital helper would find it valuable. What sets GenoClaw apart is its ability to analyze its own mistakes, identify areas for improvement, and automatically adjust its approach to become more effective each day.
README
# GenoClaw Meta-Harness
```
____ ____ _ __ __ _ _ _
/ ___| ___ _ __ ___ / ___| | __ ___ __| \/ | ___| |_ __ _ | | | | __ _ _ __ _ __ ___ ___ ___
| | _ / _ \ '_ \ / _ \| | | |/ _` \ \ /\ / /| |\/| |/ _ \ __/ _` | | |_| |/ _` | '__| '_ \ / _ \/ __/ __|
| |_| | __/ | | | (_) | |___| | (_| |\ V V / | | | | __/ || (_| | | _ | (_| | | | | | | __/\__ \__ \
\____|\___|_| |_|\___/ \____|_|\__,_| \_/\_/ |_| |_|\___|\__\__,_| |_| |_|\__,_|_| |_| |_|\___||___/___/
A self-improving AI agent scaffold.
Inspired by the Meta-Harness paper.
```
## What Is This?
**GenoClaw Meta-Harness** is a scaffolding system that turns Claude Code into a persistent, self-improving AI agent. It implements the principles from the [Meta-Harness paper](https://yoonholee.com/meta-harness/) by Yoonho Lee et al. — treating the agent's own behavior (prompts, memory strategies, skill definitions, tool-call patterns) as the optimization target.
The core insight from the paper: give the optimizer **full access to raw execution traces** (not compressed summaries) so it can trace failures to specific decisions and propose targeted improvements. GenoClaw implements this with:
- **Execution trace storage** — every tool call, error, timing, and outcome is recorded
- **Skeptical evaluator** — a second LLM pass that rejects bad memories/skills before they're saved
- **Skill outcome tracking** — counterfactual diagnosis: "did this skill actually help?"
- **Harness candidate versioning** — BEHAVIORS.md snapshots with fitness scores and auto-rollback
- **Nightly self-evaluation** — automated scoring with stagnation detection
The result: an agent that looks at its own failures, proposes improvements to its own behavior, tests them, and only keeps what works. It gets better every day.
> **Paper reference:** [Meta-Harness: End-to-End Optimization of Model Harnesses](https://yoonholee.com/meta-harness/) — Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, Chelsea Finn (2026)
>
> Also incorporates concepts from [Natural-Language Agent Harnesses](https://arxiv.org/abs/2603.25723) (NLAH) — contracts, roles, stage structure, failure taxonomy.
---
## Architecture
```
┌──────────────────────────────────────────────────────────────────────┐
│ CLAUDE CODE SESSION │
│ ┌────────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ Discord │ │ Hooks │ │ MCP │ │ Agent Tools │ │
│ │ Plugin │ │ (13) │ │ Servers │ │ (Bash,Read,Edit) │ │
│ │ (channels) │ │ │ │ │ │ │ │
│ └─────┬──────┘ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │ │
│ ┌─────┴──────────────┴─────────────┴───────────────────┴──────────┐ │
│ │ META-HARNESS LOOP │ │
│ │ │ │
│ │ ┌──────────┐ ┌───────────┐ ┌──────────┐ ┌──────────┐ │ │
│ │ │ PROPOSE │───>│ EVALUATE │───>│ TRACE │───>│ SCORE │ │ │
│ │ │ (nudge │ │ (skeptic │ │ (record │ │ (heat + │ │ │
│ │ │ loop) │ │ evaluator│ │ outcomes)│ │ fitness)│ │ │
│ │ └──────────┘ └───────────┘ └──────────┘ └──────────┘ │ │
│ │ │ │ │ │
│ │ └──────── VERSION ── ROLLBACK IF WORSE ────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬───────────────────────────────────────┘
│
┌─────────────┴─────────────┐
│ SCAFFOLDING DAEMON │
│ localhost:8097 (v2.0.0) │
│ │
│ Meta-Harness Modules: │
│ • trace_recorder.py │
│ • nudge_loop.py │
│ • skill_manager.py │
│ • skill_outcome_tracker │
│ • harness_snapshot.py │
│ • harness_evaluator.py │
│ • failure_store.py │
│ • role_manager.py │
│ • unified_search.py │
│ • heat_scoring.py │
│ • memory_scanner.py │
│ │
│ Background Loops: │
│ • Nudge review (30 min) │
│ • Voice monitor (60s) │
│ • Memory audit (6hr) │
│ • Session archive (30m) │
│ • Heat decay (1hr) │
│ • Nightly eval (3 AM) │
│ │
│ 30+ REST API Endpoints │
│ 468 Tests │
└────────────────────────────┘
```
---
## Meta-Harness Features
### Phase 1 — The Foundation (v1.6.0)
| Feature | Module | Description |
|---------|--------|-------------|
| Execution Traces | `trace_recorder.py` | Records every tool call, error, timing, and outcome per exchange |
| Full-History Context | `nudge_loop.py` | Nudge loop queries ALL prior history before each review, not just last N |
| Failure Modes | `skill_manager.py` | Skills document known failure patterns with detection regexes and recovery procedures |
| Standing Contracts | `contracts/standing.json` | Machine-readable behavioral contracts (TTS, promises, corrections) |
| Failure Taxonomy | `contracts/failure_taxonomy.json` | 8 named failure modes with recovery verbs and escalation rules |
### Phase 2 — The Feedback Loop (v1.7.0)
| Feature | Module | Description |
|---------|--------|-------------|
| Skeptical Evaluator | `nudge_loop.py` | Second LLM pass gates all memory/skill writes with strict rubric |
| Skill Outcomes | `skill_outcome_tracker.py` | Tracks invocations and outcomes for counterfactual diagnosis |
| Heat Fitness | `heat_scoring.py` | Gap analysis steers nudge loop toward under-documented areas |
| Per-Channel Memory | `nudge_loop.py` | Discord channel-specific memory files in `warm/channels/` |
### Phase 3 — Self-Evolution (v2.0.0)
| Feature | Module | Description |
|---------|--------|-------------|
| Harness Versioning | `harness_snapshot.py` | Snapshots BEHAVIORS.md with fitness scores; rollback if quality drops |
| Failure Store | `failure_store.py` | Live failure event database with pattern detection |
| Role Structure | `role_manager.py` | SOLVER / VERIFIER / RESEARCHER / ORCHESTRATOR with permission matrices |
| Nightly Evaluation | `harness_evaluator.py` | Automated scoring at 3 AM with stagnation detection and alerts |
---
## Memory System
GenoClaw uses a multi-tier memory architecture ported from [Nous Research's Hermes Agent](https://github.com/NousResearch/hermes-agent):
- **Hot tier** — always loaded into session context (~5 files)
- **Warm tier** — consolidated references, loaded on demand
- **Cold tier** — archived, searchable via FTS5
- **Semantic** — mem0 with ChromaDB + Ollama embeddings
- **Per-channel** — Discord channel-specific learnings in `warm/channels/`
- **Heat decay** — exponential decay scoring by fact type, auto-tier transitions
The **nudge loop** runs every 30 minutes:
1. Queries full history (FTS5 + mem0 + prior reviews)
2. Identifies gap areas via heat fitness analysis
3. Proposes memories and skills (generator LLM)
4. Gates through skeptical evaluator (rejects junk)
5. Tracks skill outcomes for counterfactual diagnosis
6. Saves per-channel so context doesn't bleed
---
## Discord Integration
Native real-time delivery via Claude Code's `--channels` flag:
```bash
claude --channels plugin:discord@claude-plugins-official
```
Mess
[truncated…]PUBLIC HISTORY
IDENTITY
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
README BADGE
Add to your README:
