githubinferredactive
ResearchClawBench
provenance:github:InternScience/ResearchClawBench
ResearchClawBench: Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery
README
<div align="center">
<h1>ResearchClawBench</h1>
</div>
<div align="center">
[](https://InternScience.github.io/ResearchClawBench-Home/) 
[](https://github.com/InternScience/ResearchClawBench) 
[](https://huggingface.co/datasets/InternScience/ResearchClawBench) 
[](LICENSE)
[](https://www.python.org/)
[](#-scientific-domains)
[](#-scientific-domains)
[](https://github.com/InternScience/ResearchClawBench)
**Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery**
[Quick Start](#-quick-start) | [Submit Tasks](#-submit-new-tasks) | [How It Works](#%EF%B8%8F-how-it-works) | [Domains](#-scientific-domains) | [Leaderboard](#-leaderboard) | [Add Your Agent](#-add-your-own-agent)
</div>
<p align="center">
<img src="assets/teaser.png" alt="SGI Overview" width="600">
</p>
---
ResearchClawBench is a benchmark that measures whether AI coding agents can **independently conduct scientific research** — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against **real human-authored papers**.
Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: *given the same data and tools a human researcher had, can an AI agent arrive at the same (or better) scientific conclusions?*
## Overview
### ✨ Highlights
<table>
<tr>
<td align="center" width="25%">🔄<br/><b>Two-Stage Pipeline</b><br/><sub>Autonomous research + rigorous peer-review-style evaluation</sub></td>
<td align="center" width="25%">🧪<br/><b>40 Real-Science Tasks</b><br/><sub>10 disciplines, complete datasets from published papers</sub></td>
<td align="center" width="25%">👁️<br/><b>Expert-Annotated Data</b><br/><sub>Tasks, checklists & datasets curated by domain experts</sub></td>
<td align="center" width="25%">🤖<br/><b>Multi-Agent Support</b><br/><sub>Claude Code, Codex CLI, OpenClaw, Nanobot & custom agents</sub></td>
</tr>
<tr>
<td align="center">🚀<br/><b>Re-Discovery to New-Discovery</b><br/><sub>50 = match the paper, 70+ = surpass it</sub></td>
<td align="center">📋<br/><b>Fine-Grained Checklist</b><br/><sub>Per-item keywords, weights & reasoning</sub></td>
<td align="center">📡<br/><b>Live Streaming UI</b><br/><sub>Watch agents code, plot & write in real-time</sub></td>
<td align="center">🍃<br/><b>Lightweight Dependencies</b><br/><sub>Pure Flask + vanilla JS, no heavy frameworks</sub></td>
</tr>
</table>
### 🎬 Demo
https://github.com/user-attachments/assets/94829265-80a8-4d61-a744-3800603de6d9
### 💡 Why ResearchClawBench?
Most AI benchmarks evaluate what models **know**. We evaluate what agents can **do**.
- **Real science, not toy problems.** 40 tasks sourced from published papers across 10 disciplines, each with complete experimental datasets.
- **Two-stage pipeline.** Autonomous research first, rigorous evaluation second — just like peer review.
- **Fine-grained, multimodal scoring.** A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
- **Agent-agnostic.** Ships with first-class support for Claude Code, Codex CLI, and OpenClaw. Bring your own agent in one line.
- **From Re-Discovery to New-Discovery.** Scoring above 50 means matching the original paper; above 70 means *surpassing* it. The frontier is wide open.
### 📢 News
- **2026-03-30** 🧬 Added built-in [EvoScientist](https://github.com/EvoScientist/EvoScientist) support and clarified multimodal judge prompting so the first attached image is explicitly treated as the ground-truth figure.
- **2026-03-27** 🤗 Released a Hugging Face dataset mirror at [InternScience/ResearchClawBench](https://huggingface.co/datasets/InternScience/ResearchClawBench), including 10 additional tasks from ResearchClawBench-Self and a task downloader script.
- **2026-03-27** 📨 Opened the [ResearchClawBench submission Space](https://huggingface.co/spaces/InternScience/ResearchClawBench-Task-Submit) for community task uploads. New tasks are validated there and reviewed through Hugging Face dataset PRs instead of being added to this GitHub repository.
- **2026-03-20** 🐈 Added [Nanobot](https://github.com/HKUDS/nanobot) as a new agent — ultra-lightweight OpenClaw alternative with reliable multi-step tool execution. Agent config moved to `agents.json` for easy customization.
- **2026-03-19** 🚀 Initial release with Claude Code, Codex CLI, and OpenClaw support. 40 tasks across 10 scientific domains.
---
## Understanding The Benchmark
### 🏗️ Data Construction
Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:
```mermaid
flowchart TD
A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑🔬 Human Expert Extraction\n(Core Task Instructions)"]
B --> C["📋 Evaluation Checklist\n(Criteria + Keywords + Weights)"]
B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
C --> E["✅ Human Reproduction & Validation\n(Verify checklist is reproducible)"]
D --> E
style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px
```
1. **High-Quality Paper Collection** — Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.
2. **Expert Task Extraction** — Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.
3. **Checklist Design** — Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.
4. **Data & Related Work Collection** — The original datasets used in the paper are gathered, along with relevant reference materials, to form a self-contained research workspace.
5. **Human Reproduction & Validation** — Human researchers independently reproduce the paper's results using only the provided data and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.
### ⚙️ How It Works
ResearchClawBench operates in two distinct stages:
```mermaid
flowchart LR
subgraph Stage1["Stage 1 — Auto Research"]
A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
B --> C["Code\n+ Figures\n+ Report"]
end
subgraph Stage2["Stage 2 — Evaluation"]
C --> D["LLM Judge"]
E["Target Paper\n+ Checklist"] --> D
D --> F["Per-Item Scores\n+ Reasoning"]
end
style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px
```
#### Stage 1: Autonomous Research
<div align="center">
<img src="assets/auto-research.png" width="90%" />
<p><em>Auto Research view — file explorer, live code output, and real-time agent conversation</em></p>
</div>
The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:
1. **Explore** the data and understand the research question
2. **Write code** to analyze, model, and visual
[truncated…]PUBLIC HISTORY
First discoveredMar 21, 2026
IDENTITY
inferred
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
platformgithub
first seenMar 18, 2026
last updatedMar 21, 2026
last crawled27 days ago
version—
README BADGE
Add to your README:
