autospec

provenance:github:murzua7/autospec

WHAT THIS AGENT DOES

What it does: autospec is like a tireless bug hunter for your software. It automatically analyzes your code and tries to find hidden errors that traditional testing might miss. It does this by repeatedly checking every possible scenario, ensuring your software behaves correctly in all situations. What problem it solves: Many software systems have subtle, hard-to-find bugs that can cause serious problems – think unexpected outages or data corruption. These bugs are often too complex for standard testing to uncover. autospec helps eliminate these risks by finding these "edge case" errors before they impact your customers. Who would use it: Anyone responsible for the reliability and quality

View Source ↗First seen 4mo agoNot yet hireable

USE CASES

Testing DevOps Automation

README

<p align="center">
  <img src="https://img.shields.io/badge/TLA%2B-Model%20Checked-blue?style=for-the-badge" />
  <img src="https://img.shields.io/badge/Python-3.10%2B-green?style=for-the-badge&logo=python&logoColor=white" />
  <img src="https://img.shields.io/badge/Claude-Anthropic-orange?style=for-the-badge" />
  <img src="https://img.shields.io/badge/Tests-29%20passed-brightgreen?style=for-the-badge" />
</p>

# autospec

**A self-supervising formal verification loop that uses an LLM agent + the TLC model checker to find bugs no test suite ever will.**

Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch) — but instead of optimizing training loss, we're eliminating bugs. The model checker is the fixed evaluator: mathematically exhaustive, deterministic, and impossible to game.

---

## The Problem

Tests check the cases you think of. Formal verification checks **every** case.

AWS found a bug in a distributed protocol that required a [35-step error trace](https://cacm.acm.org/research/how-amazon-web-services-uses-formal-methods/) — a sequence so specific that no human would write a test for it, and no fuzzer would find it in a reasonable time. TLA+ and its model checker TLC found it by exploring **every reachable state**.

The barrier? Writing TLA+ specifications is hard. It requires translating code into a formal state machine, defining invariants, and iterating when the spec is wrong.

**autospec automates this entire process.**

## How It Works

An LLM agent sits in a tight loop with the TLC model checker — the same tool AWS uses to verify S3, DynamoDB, and EBS:

```
         ┌─────────────────────────────────────────────┐
         │              program.md                      │
         │        (strategy — human-curated)            │
         └──────────────────┬──────────────────────────┘
                            │
    ┌───────────────────────▼───────────────────────────┐
    │                                                    │
    │   1. READ   target code                            │
    │   2. WRITE  TLA+ spec (.tla + .cfg)                │
    │   3. CHECK  run TLC model checker  ◄── IMMUTABLE   │
    │   4. PARSE  violations + counterexample traces     │
    │   5. DECIDE                                        │
    │      ├── Real bug?  → fix the code                 │
    │      ├── Spec error? → refine the spec             │
    │      └── Abstraction gap? → adjust level           │
    │   6. COMMIT  git (keep if improved, revert if not) │
    │   7. GOTO 1  ── NEVER STOP                         │
    │                                                    │
    └────────────────────────────────────────────────────┘
```

### The Autoresearch Analogy

| [autoresearch](https://github.com/karpathy/autoresearch) | autospec | Why it matters |
|---|---|---|
| `train.py` (mutable genome) | TLA+ specs + target code | Agent iterates on both |
| `prepare.py` (fixed evaluator) | `prepare.py` (TLC harness) | SHA256-verified each iteration |
| `val_bpb` (scalar fitness) | Violation count + **counterexample traces** | Structured feedback >> scalar |
| `program.md` (strategy) | `program.md` (strategy) | Encode verification intuition |
| `git reset` on regression | `git reset --hard HEAD~1` | Only improvements survive |
| 5-min training budget | 5-min TLC timeout | Bounded evaluation window |

**But autospec is strictly stronger** — TLC is a *complete* evaluator. It explores every reachable state, not a sample. A training loss can be gamed; an exhaustive model checker cannot.

## Quick Start

```bash
# Clone
git clone https://github.com/murzua7/autospec.git
cd autospec

# Install
pip install -e .

# TLC downloads automatically on first run (requires Java 11+)
java -version

# Run TLC on the example spec (no API key needed)
python -m autospec --check specs/BankTransfer.tla

# Run the full loop (queue mode — no API key needed, works with Claude Code Max)
python -m autospec --target /path/to/your/code --max-iters 100

# Or with direct API mode (requires credits)
python -m autospec --target /path/to/your/code --mode api
```

### Agent Modes

| Mode | Flag | Requires | How it works |
|------|------|----------|-------------|
| **Queue** (default) | `--mode queue` | A separate LLM process (e.g. Claude Code) | Writes prompts to `llm_queue/requests/`, polls for responses in `llm_queue/responses/` |
| **API** | `--mode api` | `ANTHROPIC_API_KEY` with credits | Direct Anthropic API call each iteration |

Queue mode is the default because it works with Claude Code Max subscriptions (no API credits needed). A separate Claude Code session or any LLM process reads the queue and writes responses.

## What TLC Finds (That Tests Can't)

### Example: Bank Transfer Race Condition

The included example (`target/example/bank_transfer.py`) has a deliberate TOCTOU bug in a concurrent bank transfer:

```python
def transfer(self, from_id, to_id, amount):
    current_balance = from_acc.balance      # READ  (non-atomic)
    if current_balance < amount:            # CHECK (stale value!)
        return False
    from_acc.balance = current_balance - amount  # WRITE (stale!)
    to_acc.balance += amount
```

You could run the threaded version 10,000 times and maybe see the bug once. **TLC finds it in <2 seconds by exploring all 2,836 reachable states:**

```
============================================================
TLC RESULT: BankTransfer.tla
============================================================
Passed: NO
Violations: 1 (Invariant: MoneyConserved)

VIOLATION 1: [INVARIANT] MoneyConserved
  Trace (5 steps):
    State 1: balance = (a1:0, a2:1), init_total = 1
    State 2: t1 starts transfer from a2→a1
    State 3: t1 reads balance[a2] = 1          ← snapshot
    State 4: t1 checks: 1 >= 1, proceeds
    State 5: t1 writes balance[a2] = 1-1 = 0   ← but total is now 0 ≠ 1
============================================================
```

The counterexample trace tells the agent *exactly* what went wrong — not just "it failed," but the precise 5-step sequence of states that violates money conservation.

## Self-Verification: autospec Verified Itself

We ran autospec on its own codebase. TLC verified 3 critical subsystems across **5,423 states** and **22 invariants**:

| Spec | States | Invariants | What it proved |
|------|--------|-----------|---------------|
| `AcceptRejectGate.tla` | 244 | 8 | Loop never accepts a regression; reverts only on DISCARD; CLEAN = zero violations |
| `FileSandbox.tla` | 2,304 | 7 | Forbidden files never written; path traversal blocked; only allowed dirs writable |
| `IntegrityCheck.tla` | 39 | 4 | Evaluator integrity verified before every trust boundary |

### Bugs Found in autospec by autospec

**1. Defense-in-depth gap** (IntegrityCheck.tla, 10-step trace)

TLC found that the integrity check could be skipped between `apply_changes()` and `evaluate_all_specs()`. Under adversarial conditions, a compromised evaluator could run for one iteration before detection.

*Fixed:* Added second integrity check right before the evaluation trust boundary.

**2. Stale state between iterations** (AcceptRejectGate.tla)

TLC showed the gate decision variable persisted across iteration boundaries — a state machine hygiene issue that could mask bugs in future changes.

*Fixed:* Explicit state reset in the Evaluate action.

## Architecture

```
autospec/
├── prepare.py          # IMMUTABLE evaluator (TLC harness + parser)
│                       #   SHA256 integrity-checked every iteration
├── autospec.py         # Main loop orchestrator (agent + git + accept/reject)
├── program.md          # Agent strategy document (TLA+ guidelines + heuristics)
├── __main__.py         # CLI entry point
├── specs/              # TLA+ specifications (agent-mutable)
│   ├── AcceptRejectGate.tla/.cfg  # Self-spec: loop logic
│   ├── FileSandbox.tla/.cfg       # Self-spec: security sandbox
│   ├── IntegrityCheck.tla/.cfg    # Self-spec: evaluator integri

[truncated…]

PUBLIC HISTORY

First discoveredMar 26, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 15, 2026

last updatedMar 25, 2026

last crawled3 months ago

version—

RELATED AGENTS

askimo

Askimo is a platform that lets you interact with artificial intelligence in a simple way, whether through chatting, sear

blog-writer-multi-agents

Here's a plain English summary of the blog-writer-multi-agents AI agent: This agent automatically creates professional-

boss-skill

This agent, boss-skill, is designed to help employees navigate challenging workplace dynamics, particularly those involv

strands-multi-engineer-agent

This agent helps businesses understand how different AI models perform when tackling the same engineering task. It runs

J.E.L.L.Y._AI

J.E.L.L.Y._AI is an article writing AI developed by its creator. This repository is publicly available for job seeking p

More Testing agents →

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:murzua7/autospec)