ResearchClawBench

provenance:github:InternScience/ResearchClawBench

WHAT THIS AGENT DOES

ResearchClawBench is a tool designed to test how well artificial intelligence can perform scientific research. It presents AI systems with the same data and tools that human researchers use and then assesses whether the AI can reach similar or even better conclusions. This benchmark evaluates AI's ability to independently conduct research, from analyzing data to writing reports, ultimately aiming to replicate or improve upon published scientific findings. Researchers and developers working on AI agents for scientific discovery would find this tool valuable for measuring and improving their systems. What makes ResearchClawBench unique is its focus on rigorous evaluation against existing scientific literature, pushing AI beyond simple coding or factual recall.

View Source ↗First seen 4mo agoNot yet hireable

README

<div align="center">
  <h1>ResearchClawBench</h1>
</div>

<div align="center">

[![Official Site](https://img.shields.io/badge/Official%20Site-333399.svg?logo=homepage)](https://InternScience.github.io/ResearchClawBench-Home/)&#160;
[![GitHub](https://img.shields.io/badge/GitHub-000000?logo=github&logoColor=white)](https://github.com/InternScience/ResearchClawBench)&#160;
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-gray)](https://huggingface.co/datasets/InternScience/ResearchClawBench)&#160;
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/)
[![Domains](https://img.shields.io/badge/Domains-10-green.svg)](#-scientific-domains)
[![Tasks](https://img.shields.io/badge/Tasks-40-orange.svg)](#-scientific-domains)
[![GitHub](https://img.shields.io/github/stars/InternScience/ResearchClawBench?style=social)](https://github.com/InternScience/ResearchClawBench)

**Evaluating AI Agents for Automated Research from Re-Discovery to New-Discovery**


[Quick Start](#-quick-start) | [Submit Tasks](#-submit-new-tasks) | [How It Works](#%EF%B8%8F-how-it-works) | [Domains](#-scientific-domains) | [Leaderboard](#-leaderboard) | [Add Your Agent](#-add-your-own-agent)

</div>

<p align="center">
  <img src="assets/teaser.png" alt="SGI Overview" width="600">
</p>

---

ResearchClawBench is a benchmark that measures whether AI coding agents can **independently conduct scientific research** — from reading raw data to producing publication-quality reports — and then rigorously evaluates the results against **real human-authored papers**.

Unlike benchmarks that test coding ability or factual recall, ResearchClawBench asks: *given the same data and tools a human researcher had, can an AI agent arrive at the same (or better) scientific conclusions?*

## Overview

### ✨ Highlights

<table>
<tr>
<td align="center" width="25%">🔄<br/><b>Two-Stage Pipeline</b><br/><sub>Autonomous research + rigorous peer-review-style evaluation</sub></td>
<td align="center" width="25%">🧪<br/><b>40 Real-Science Tasks</b><br/><sub>10 disciplines, complete datasets from published papers</sub></td>
<td align="center" width="25%">👁️<br/><b>Expert-Annotated Data</b><br/><sub>Tasks, checklists & datasets curated by domain experts</sub></td>
<td align="center" width="25%">🤖<br/><b>Multi-Agent Support</b><br/><sub>Claude Code, Codex CLI, OpenClaw, Nanobot & custom agents</sub></td>
</tr>
<tr>
<td align="center">🚀<br/><b>Re-Discovery to New-Discovery</b><br/><sub>50 = match the paper, 70+ = surpass it</sub></td>
<td align="center">📋<br/><b>Fine-Grained Checklist</b><br/><sub>Per-item keywords, weights & reasoning</sub></td>
<td align="center">📡<br/><b>Live Streaming UI</b><br/><sub>Watch agents code, plot & write in real-time</sub></td>
<td align="center">🍃<br/><b>Lightweight Dependencies</b><br/><sub>Pure Flask + vanilla JS, no heavy frameworks</sub></td>
</tr>
</table>

### 🎬 Demo

https://github.com/user-attachments/assets/94829265-80a8-4d61-a744-3800603de6d9

### 💡 Why ResearchClawBench?

Most AI benchmarks evaluate what models **know**. We evaluate what agents can **do**.

- **Real science, not toy problems.** 40 tasks sourced from published papers across 10 disciplines, each with complete experimental datasets.
- **Two-stage pipeline.** Autonomous research first, rigorous evaluation second — just like peer review.
- **Fine-grained, multimodal scoring.** A weighted checklist with text and image criteria, judged by an LLM acting as a strict peer reviewer.
- **Agent-agnostic.** Ships with first-class support for Claude Code, Codex CLI, and OpenClaw. Bring your own agent in one line.
- **From Re-Discovery to New-Discovery.** Scoring above 50 means matching the original paper; above 70 means *surpassing* it. The frontier is wide open.

### 📢 News

- **2026-03-30** 🧬 Added built-in [EvoScientist](https://github.com/EvoScientist/EvoScientist) support and clarified multimodal judge prompting so the first attached image is explicitly treated as the ground-truth figure.
- **2026-03-27** 🤗 Released a Hugging Face dataset mirror at [InternScience/ResearchClawBench](https://huggingface.co/datasets/InternScience/ResearchClawBench), including 10 additional tasks from ResearchClawBench-Self and a task downloader script.
- **2026-03-27** 📨 Opened the [ResearchClawBench submission Space](https://huggingface.co/spaces/InternScience/ResearchClawBench-Task-Submit) for community task uploads. New tasks are validated there and reviewed through Hugging Face dataset PRs instead of being added to this GitHub repository.
- **2026-03-20** 🐈 Added [Nanobot](https://github.com/HKUDS/nanobot) as a new agent — ultra-lightweight OpenClaw alternative with reliable multi-step tool execution. Agent config moved to `agents.json` for easy customization.
- **2026-03-19** 🚀 Initial release with Claude Code, Codex CLI, and OpenClaw support. 40 tasks across 10 scientific domains.

---

## Understanding The Benchmark

### 🏗️ Data Construction

Every task in ResearchClawBench is built through a rigorous, expert-driven pipeline to ensure scientific validity and reproducibility:

```mermaid
flowchart TD
    A["📄 High-Quality Paper Collection\n(Target Paper)"] --> B["🧑‍🔬 Human Expert Extraction\n(Core Task Instructions)"]
    B --> C["📋 Evaluation Checklist\n(Criteria + Keywords + Weights)"]
    B --> D["📂 Data & Related Work Collection\n(Datasets + Reference Papers)"]
    C --> E["✅ Human Reproduction & Validation\n(Verify checklist is reproducible)"]
    D --> E

    style A fill:#e0f2fe,stroke:#0284c7,stroke-width:2px
    style B fill:#fef3c7,stroke:#f59e0b,stroke-width:2px
    style C fill:#fce7f3,stroke:#ec4899,stroke-width:2px
    style D fill:#f0fdf4,stroke:#22c55e,stroke-width:2px
    style E fill:#f5f3ff,stroke:#8b5cf6,stroke-width:2px
```

1. **High-Quality Paper Collection** — Domain experts select recent, high-impact publications with clear methodology and reproducible results across 10 scientific disciplines.

2. **Expert Task Extraction** — Human experts read each paper and distill the core research task into structured instructions, identifying the key scientific question, input data, and expected outputs.

3. **Checklist Design** — Experts create a fine-grained evaluation checklist with weighted criteria (text and image items), each with specific technical keywords that a judge must verify.

4. **Data & Related Work Collection** — The original datasets used in the paper are gathered, along with relevant reference materials, to form a self-contained research workspace.

5. **Human Reproduction & Validation** — Human researchers independently reproduce the paper's results using only the provided data and instructions, verifying that every checklist item is achievable. This ensures the benchmark is fair and the checklist is grounded in reality.

### ⚙️ How It Works

ResearchClawBench operates in two distinct stages:

```mermaid
flowchart LR
    subgraph Stage1["Stage 1 &mdash; Auto Research"]
        A["Raw Data\n+ Instructions"] --> B["AI Agent\n(autonomous)"]
        B --> C["Code\n+ Figures\n+ Report"]
    end

    subgraph Stage2["Stage 2 &mdash; Evaluation"]
        C --> D["LLM Judge"]
        E["Target Paper\n+ Checklist"] --> D
        D --> F["Per-Item Scores\n+ Reasoning"]
    end

    style Stage1 fill:#f0f4ff,stroke:#3b82f6,stroke-width:2px
    style Stage2 fill:#fff7ed,stroke:#f59e0b,stroke-width:2px
```

#### Stage 1: Autonomous Research

<div align="center">
<img src="assets/auto-research.png" width="90%" />
<p><em>Auto Research view — file explorer, live code output, and real-time agent conversation</em></p>
</div>

The AI agent receives a workspace containing raw datasets, reference materials, and task instructions. It must independently:

1. **Explore** the data and understand the research question
2. **Write code** to analyze, model, and visual

[truncated…]

PUBLIC HISTORY

First discoveredMar 21, 2026

IDENTITY

inferred

Identity inferred from code signals. No PROVENANCE.yml found.

Is this yours? Claim it →

METADATA

platformgithub

first seenMar 18, 2026

last updatedMar 21, 2026

last crawled3 months ago

version—

README BADGE

Add to your README:

![Provenance](https://getprovenance.dev/api/badge?id=provenance:github:InternScience/ResearchClawBench)