AGENTS / GITHUB / eval-view
githubinferredactive

eval-view

provenance:github:hidai25/eval-view
WHAT THIS AGENT DOES

Eval-View helps ensure your AI agents continue to work as expected over time. It identifies subtle changes in how an agent behaves, even if the system appears to be functioning normally. This addresses the problem of "silent regressions," where updates to AI models or providers can alter an agent's actions without breaking standard tests. Developers, startups, and small AI teams can use Eval-View to track these changes, understand whether they are due to external factors or internal issues, and automatically fix minor problems. What sets it apart is its ability to not only detect differences but also classify them, allowing for targeted review and automated correction, ultimately leading to more reliable AI agent performance.

View Source ↗First seen 5mo agoNot yet hireable
README
<!-- mcp-name: io.github.hidai25/evalview-mcp -->
<!-- keywords: AI agent testing, regression detection, golden baselines -->

<p align="center">
  <img src="assets/logo.png" alt="EvalView" width="350">
  <br>
  <strong>The open-source behavior regression gate for AI agents.</strong><br>
  Think Playwright, but for tool-calling and multi-turn AI agents.
</p>

<p align="center">
  <a href="https://pypi.org/project/evalview/"><img src="https://img.shields.io/pypi/v/evalview.svg?label=release" alt="PyPI version"></a>
  <a href="https://pypi.org/project/evalview/"><img src="https://img.shields.io/pypi/dm/evalview.svg?label=downloads" alt="PyPI downloads"></a>
  <a href="https://github.com/hidai25/eval-view/stargazers"><img src="https://img.shields.io/github/stars/hidai25/eval-view?style=social" alt="GitHub stars"></a>
  <a href="https://github.com/hidai25/eval-view/actions/workflows/ci.yml"><img src="https://github.com/hidai25/eval-view/actions/workflows/ci.yml/badge.svg" alt="CI"></a>
  <a href="https://opensource.org/licenses/Apache-2.0"><img src="https://img.shields.io/badge/License-Apache_2.0-blue.svg" alt="License"></a>
  <a href="https://github.com/hidai25/eval-view/graphs/contributors"><img src="https://img.shields.io/github/contributors/hidai25/eval-view" alt="Contributors"></a>
</p>

---

Your agent can still return `200` and be wrong. A model or provider update can change tool choice, skip a clarification, or degrade output quality without changing your code or breaking a health check. **EvalView catches those silent regressions before users do.**

**You don't need frontier-lab resources to run a serious agent regression loop.** EvalView gives solo devs, startups, and small AI teams the same core discipline: snapshot behavior, detect drift, classify changes, and review or heal them safely.

**Traditional tests tell you if your agent is up. EvalView tells you if it still behaves correctly.** It tracks drift across outputs, tools, model IDs, and runtime fingerprints, so you can tell "the provider changed" from "my system regressed."

[![demo.gif](assets/demo.gif)](https://github.com/user-attachments/assets/96d8b5f7-3561-44a1-86a4-270fb0d1d8a6)

**30-second live demo.**


Most eval tools stop at detect and compare. EvalView helps you classify changes, inspect drift, and auto-heal the safe cases.

- Catch silent regressions that normal tests miss
- Separate provider/model drift from real system regressions
- Auto-heal flaky failures with retries, review gates, and audit logs

Built for **frontier-lab rigor, startup-team practicality**:
- targeted behavior runs instead of giant always-on eval suites
- deterministic diffs first, LLM judgment where it adds signal
- faster loops from change -> eval -> review -> ship

[How we run EvalView with this operating model →](docs/OPERATING_MODEL.md)

```
  ✓ login-flow           PASSED
  ⚠ refund-request       TOOLS_CHANGED
      - lookup_order → check_policy → process_refund
      + lookup_order → check_policy → process_refund → escalate_to_human
  ✗ billing-dispute      REGRESSION  -30 pts
      Score: 85 → 55  Output similarity: 35%
```

## Quick Start

```bash
pip install evalview
```

```bash
evalview init        # Detect agent, auto-configure profile + starter suite
evalview snapshot    # Save current behavior as baseline
evalview check       # Catch regressions after every change
```

That's it. Three commands to regression-test any AI agent. `init` auto-detects your agent type (chat, tool-use, multi-step, RAG, coding) and configures the right evaluators, thresholds, and assertions.

<details>
<summary><strong>Other install methods</strong></summary>

```bash
curl -fsSL https://raw.githubusercontent.com/hidai25/eval-view/main/install.sh | bash
```

</details>

<details>
<summary><strong>No agent yet? Try the demo</strong></summary>

```bash
evalview demo        # See regression detection live (~30 seconds, no API key)
```

Or clone a real working agent with built-in tests:

```bash
git clone https://github.com/hidai25/evalview-support-automation-template
cd evalview-support-automation-template
make run
```

</details>

<details>
<summary><strong>More entry paths</strong></summary>

```bash
evalview generate --agent http://localhost:8000           # Generate tests from a live agent
evalview capture --agent http://localhost:8000/invoke      # Capture real user flows (runs assertion wizard after)
evalview capture --agent http://localhost:8000/invoke --multi-turn  # Multi-turn conversation as one test
evalview generate --from-log traffic.jsonl                # Generate from existing logs
evalview init --profile rag                               # Override auto-detected agent profile
```

</details>

## Why EvalView?

Use LangSmith for observability. Use Braintrust for scoring. **Use EvalView for regression gating.**

|  | LangSmith | Braintrust | Promptfoo | **EvalView** |
|---|:---:|:---:|:---:|:---:|
| **Primary focus** | Observability | Scoring | Prompt comparison | **Regression detection** |
| Tool call + parameter diffing | — | — | — | **Yes** |
| Golden baseline regression | — | Manual | — | **Automatic** |
| Silent model change detection | — | — | — | **Yes** |
| Auto-heal (retry + variant proposal) | — | — | — | **Yes** |
| PR comments with alerts | — | — | — | **Cost, latency, model change** |
| Works without API keys | No | No | Partial | **Yes** |
| Production monitoring | Tracing | — | — | **Check loop + Slack** |

[Detailed comparisons →](docs/COMPARISONS.md)

## What It Catches

| Status | Meaning | Action |
|--------|---------|--------|
| ✅ **PASSED** | Behavior matches baseline | Ship with confidence |
| ⚠️ **TOOLS_CHANGED** | Different tools called | Review the diff |
| ⚠️ **OUTPUT_CHANGED** | Same tools, output shifted | Review the diff |
| ❌ **REGRESSION** | Score dropped significantly | Fix before shipping |

### Model / Runtime Change Detection

EvalView does more than compare `model_id`.

- **Declared model change**: adapter-reported model changed from baseline
- **Runtime fingerprint change**: observed model labels in the trace changed, even when the top-level model name is missing
- **Coordinated drift**: multiple tests shift together in the same check run, which often points to a silent provider rollout or runtime change

When detected, `evalview check` surfaces a run-level signal with a classification (`declared` or `suspected`), confidence level, and evidence from fingerprints, retries, and affected tests.

If the new behavior is correct, rerun `evalview snapshot` to accept the updated baseline.

**Four scoring layers** — the first two are free and offline:

| Layer | What it checks | Cost |
|-------|---------------|------|
| **Tool calls + sequence** | Exact tool names, order, parameters | Free |
| **Code-based checks** | Regex, JSON schema, contains/not_contains | Free |
| **Semantic similarity** | Output meaning via embeddings | ~$0.00004/test |
| **LLM-as-judge** | Output quality scored by LLM (GPT, Claude, Gemini, DeepSeek, Ollama) | ~$0.01/test |

```
Score Breakdown
  Tools 100% ×30%    Output 42/100 ×50%    Sequence ✓ ×20%    = 54/100
  ↑ tools were fine   ↑ this is the problem
```

## CI/CD Integration

Block broken agents in every PR. One step — PR comments, artifacts, and job summary are automatic.

```yaml
# .github/workflows/evalview.yml — copy this, add your secret, done
name: EvalView Agent Check
on: [pull_request, push]

jobs:
  agent-check:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v4

      - name: Check for agent regressions
        uses: hidai25/eval-view@main
        with:
          openai-api-key: ${{ secrets.OPENAI_API_KEY }}
```

<details>
<summary><strong>What lands on your PR</strong></summary>

```
## ✅ EvalView: PASSED

| Metric | Value |
|--------|-------|
| Tests | 5/5 unchanged (100%) |

---
*Generated by EvalView*
```

When something breaks:

```
## ❌ EvalView: REGRESSION

> **A

[truncated…]

PUBLIC HISTORY

First discoveredMar 21, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub
first seenNov 17, 2025
last updatedMar 21, 2026
last crawledtoday
version

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:hidai25/eval-view)