Open-AgentRL
Open-AgentRL is a system that helps artificial intelligence agents learn to perform complex tasks more effectively. It addresses the challenge of teaching AI to make good decisions in dynamic situations, like managing a project or navigating a customer service interaction. Businesses could use this to improve the performance of AI assistants, automate workflows, or create more responsive chatbots. What makes it special is its ability to continuously refine all parts of the learning process – how the AI understands its environment, how it makes choices, and how it evaluates those choices – all working together to improve results. This leads to faster learning and better outcomes compared to traditional methods that rely heavily on human input.
README
<div align="center">
<img src="figs/image2.png" width="330">
</div>
### RLAnything & DemyAgent: Open-Source RL for LLMs and Agentic Scenarios
<details>
<summary>
<b>RLAnything</b>
<a href="https://arxiv.org/abs/2602.02488">
<img src="https://img.shields.io/badge/Paper-Arxiv%202602.02488-red?logo=arxiv&logoColor=red" alt="Paper" height="18" />
</a>
<a href="https://huggingface.co/collections/Gen-Verse/open-agentrl">
<img src="https://img.shields.io/badge/Models-Policy%20&%20Reward-FFCC00?logo=huggingface&logoColor=yellow" alt="Model" height="18" />
</a>
<a href="https://yinjjiew.github.io/projects/rlanything/">
<img src="https://img.shields.io/badge/Blog-RLAnything-blue?logo=rss&logoColor=white" alt="Blog" height="18" />
</a>
<b>(click to expand)</b>
</summary>
<div align="center">
<h3>
RLAnything: Forge Environment, Policy, and Reward Model<br>
in Completely Dynamic RL System
</h3>
</div>
<table class="center"> <tr> <td width=100% style="border: none"><img src="figs/rlanythingoverview.png" style="width:100%"></td> </tr> <tr> <td width="100%" style="border: none; text-align: center; word-wrap: break-word">An overview of our research on RLAnything. </td> </tr> </table>
In this work, we propose RLAnything, a reinforcement learning framework that dynamically optimizes each component through closed-loop optimization, amplifying learning signals and strengthening the overall system:
* The policy is trained with integrated feedback from outcome and step-wise signals from reward model, better than using outcome only.
* Reward model is jointly optimized via consistency feedback, which in turn further improves policy training.
* Our theory-motivated automatic environment adaptation improves training for both the reward and policy models by leveraging critic feedback from each, enabling learning from experience.
* Through extensive experiments, we demonstrate each added component consistently improves the overall system.
* We show that step-wise signals from optimized reward-model outperform outcome signals that rely on human labels.
<p align="center">
<img src="figs/rlanythingpaperoverview.png" alt="Figure 1" width="600">
</p>
</details>
<details>
<summary>
<b>DemyAgent</b>
<a href="https://arxiv.org/abs/2510.11701">
<img
src="https://img.shields.io/badge/Paper-Arxiv%202510.11701-red?logo=arxiv&logoColor=red"
alt="Paper"
height="18"
style="vertical-align: middle;"
/>
</a>
<a href="https://huggingface.co/collections/Gen-Verse/open-agentrl-68eda4c05755ca5a8c663656">
<img
src="https://img.shields.io/badge/Datasets-Agent%20RL%20Datasets-orange?logo=huggingface&logoColor=yellow"
alt="Data"
height="18"
style="vertical-align: middle;"
/>
</a>
<a href="https://huggingface.co/Gen-Verse/DemyAgent-4B">
<img
src="https://img.shields.io/badge/DemyAgent%204B-DemyAgent%204B%20Model-FFCC00?logo=huggingface&logoColor=yellow"
alt="Model"
height="18"
style="vertical-align: middle;"
/>
</a>
<b>(click to expand)</b>
</summary>
<div>
<h3>Demystifying Reinforcement Learning in Agentic Reasoning</h3></div>
<table class="center"> <tr> <td width=100% style="border: none"><img src="figs/overview.png" style="width:100%"></td> </tr> <tr> <td width="100%" style="border: none; text-align: center; word-wrap: break-word">An overview of our research on agentic RL. </td> </tr> </table>
In this work, we systematically investigate three dimensions of agentic RL: **data, algorithms, and reasoning modes**. Our findings reveal:
* Real end-to-end trajectories and high-diversity datasets significantly outperform synthetic alternatives;
* Exploration-friendly techniques like reward clipping and entropy maintenance boost training efficiency;
* Deliberative reasoning with selective tool calls surpasses frequent invocation or verbose self-reasoning.
We also contribute [high-quality SFT and RL datasets](https://huggingface.co/collections/Gen-Verse/open-agentrl-68eda4c05755ca5a8c663656), demonstrating that **simple recipes enable even [4B models](https://huggingface.co/Gen-Verse/DemyAgent-4B) to outperform 32B models** on challenging benchmarks including AIME2024/2025, GPQA-Diamond, and LiveCodeBench-v6.
</details>
| | **RLAnything** | **DemyAgent** |
|---|---|---|
| **Focus** | Closed-loop RL optimization | Agentic reasoning |
| **Core Idea** | Joint optimization of policy, reward model & environment | Real trajectories + exploration-friendly techniques + deliberative reasoning |
| **Release** | LLM/GUI/Coding Policy & Reward Model | 3K SFT + 30K RL Data, SOTA-level DemyAgent-4B |
## 🚩 New Updates
- **[2026.2]** 🦞 We release [**OpenClaw-RL**](https://github.com/Gen-Verse/OpenClaw-RL), a new fully asynchronous RL framework built on top of Open-AgentRL, targeting **personalized agentic AI** trained from live conversation feedback. OpenClaw-RL introduces:
- **Binary RL (GRPO):** PRM-based scalar reward from next-state feedback for policy optimization
- **On-Policy Distillation (OPD):** Token-level directional learning from hindsight hints — richer than any scalar signal
- **Zero API keys & fully self-hosted:** conversation data never leaves your infrastructure
- **[2026.2]** We fully open-source our work [**RLAnything**](https://arxiv.org/abs/2602.02488), including:
- Training code across GUI Agent, LLM Agent, and Coding LLM settings.
- Model checkpoints: both the policy models ([RLAnything-7B/8B](https://huggingface.co/collections/Gen-Verse/open-agentrl)) and reward models ([RLAnything-Reward-8B/14B](https://huggingface.co/collections/Gen-Verse/open-agentrl)) across these settings.
- Evaluation Scripts for our models
- **[2025.10]** We fully open-source our work [**DemyAgent**](https://arxiv.org/abs/2510.11701), including:
- Training code for both SFT and RL stages
- High-quality SFT dataset (3K samples) and RL dataset (30K samples)
- Model checkpoints: SFT models (Qwen2.5-7B-RA-SFT, Qwen3-4B-RA-SFT) and RL-trained model ([DemyAgent-4B](https://huggingface.co/Gen-Verse/DemyAgent-4B))
- Evaluation Scripts for our models
## 🧭 Navigation
- **DemyAgent**:
- [Get Started](#demyagent-get-started)
- [Training](#demyagent-train)
- [Cold-Start SFT](#demyagent-cold-sft)
- [Agentic RL](#"demyagent-agent-rl")
- [Evaluation](#demyagent-eval)
- [Results](#demyagent-result)
- **RLAnything**:
- [Get Started](#rlanything-get-started)
- [Training](#rlanything-train)
- [Computer Control](#rlanything-computer-control)
- [Text-based Game](#rlanything-text-game)
- [RLVR Coding](#rlanything-coding)
- [Evaluation](#rlanything-eval)
- [Results](#rlanything-result)
## 🚀 Get Started
<a id="demyagent-get-started"></a>
### DemyAgent
```bash
git clone https://github.com/Gen-Verse/Open-AgentRL.git
conda create -n OpenAgentRL python=3.11
conda activate OpenAgentRL
cd Open-AgentRL
bash scripts/install_vllm_sglang_mcore.sh
pip install -e .[vllm]
```
<a id="rlanything-get-started"></a>
### RLAnything
```bash
conda create --name rlanything python=3.10
source activate rlanything
pip install -r requirements_rlanything.txt
```
<a id="demyagent-train"></a>
## 🔧 DemyAgent Training
<a id="demyagent-cold-sft"></a>
### Cold-Start SFT
Before you start SFT, make sure you have downloaded the [3K Agentic SFT Data](https://huggingface.co/datasets/Gen-Verse/Open-AgentRL-SFT-3K) and the corresponding base models like [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Q
[truncated…]PUBLIC HISTORY
IDENTITY
Identity inferred from code signals. No PROVENANCE.yml found.
Is this yours? Claim it →METADATA
README BADGE
Add to your README:
