# SuperRL

**Repository Path**: mirrors_microsoft/SuperRL

## Basic Information

- **Project Name**: SuperRL
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-08-28
- **Last Updated**: 2026-02-21

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning

<p align="center">
  <a href="https://arxiv.org/abs/2506.01096">
    <img src="https://img.shields.io/badge/arXiv-2506.01096-b31b1b.svg">
  </a>
</p>

## Introduction
SuperRL is a unified training framework that adaptively alternates between reinforcement learning and supervised fine-tuning for training large language models, built on top of [verl](https://github.com/volcengine/verl).

**⚠️ Important Notice: This repository only provides core SuperRL component implementations, not a complete standalone training framework. These components must be integrated with the original verl codebase and placed in the corresponding verl project locations to function properly.**

> **📄 Paper**: This repository contains the open-source implementation of **SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning** ([arXiv:2506.01096](https://arxiv.org/abs/2506.01096))  
> *Authors: Yihao Liu, Shuocheng Li, Lang Cao, Yuhang Xie, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang*

## 🎯 Overview

SuperRL addresses the challenge of sparse rewards in reinforcement learning by adaptively switching between RL and SFT modes. When every rollout for a given instance receives zero reward (indicating the absence of a learning signal), SuperRL falls back to SFT on curated offline data.

![SuperRL Framework](figure/SuperRLFramework.png)

### Key Benefits

- **Higher Sample Efficiency**: Better utilization of both online exploration and offline demonstrations
- **Stronger Generalization**: Improved performance across diverse reasoning benchmarks  
- **Improved Robustness**: Enhanced stability under sparse reward conditions

## 🚀 Features

- **Enhanced Actor Implementations**: SuperRL, HybridAdvGated, and HybridLogSigma actors
- **Hybrid Dataset Support**: Custom dataset class with special handling for `tagged_answer` and `superrl_research_response_ids`
- **Dataset Preprocessing**: Standardized preprocessing for mathematical reasoning datasets
- **Universal Reward Functions**: Comprehensive reward system for reasoning tasks
- **Built on verl v0.5.0**: Leverages the latest verl features

## 📁 Repository Structure

```
SuperRL-Opensource/
├── actor/                      # Enhanced actor implementations
│   ├── SuperRLActor.py        # Main SuperRL actor with OR logic
│   ├── HybridAdvGatedActor.py  # Hybrid actor with advantage gating
│   └── HybridLogSigmaActor.py  # Hybrid actor with log-sigma approach
├── dataset/                    # Dataset components for verl integration
│   ├── hybrid_dataset.py       # HybridDataset class with SuperRL features
│   ├── response_dataproto_compose.py  # Response data composition utilities
│   └── __init__.py             # Dataset module initialization
├── data_preprocess/            # Dataset preprocessing utilities
│   ├── common_utils.py         # Shared utilities for dataset processing
│   ├── aime2024_preprocess.py  # AIME 2024 dataset preprocessing
│   ├── aime2025_preprocess.py  # AIME 2025 dataset preprocessing
│   ├── gsm8k_preprocess.py     # GSM8K dataset preprocessing
│   ├── hitab_preprocess.py     # HiTab dataset preprocessing
│   ├── limo_preprocess.py      # LIMO dataset preprocessing
│   ├── metamath_preprocess.py  # MetaMath dataset preprocessing
│   ├── openr1_preprocess.py    # OpenR1 dataset preprocessing
│   ├── prm12k_preprocess.py    # PRM12K dataset preprocessing
│   ├── batch_process.py        # Batch processing utilities
│   └── preview_parquet.py      # Dataset preview utilities
├── demo_script/                # Demo and example scripts
│   └── superrl_demo.sh         # SuperRL training demo script
├── figure/                     # Documentation figures and diagrams
│   └── SuperRLFramework.png    # SuperRL framework illustration
├── reward/                     # Reward function implementations
│   └── superrl.py              # Universal reward function for reasoning tasks
├── CODE_OF_CONDUCT.md          # Code of conduct guidelines
├── CONTRIBUTING.md             # Contribution guidelines
├── LICENSE.txt                 # License information
├── README.md                   # Project documentation
├── SECURITY.md                 # Security policy
└── SUPPORT.md                  # Support information
```

## 🛠️ Quick Start

### Prerequisites

**This project is an extension component for verl and cannot run independently. Please complete the verl installation and setup first.**

Make sure you have [verl v0.5.0](https://github.com/volcengine/verl/releases/tag/v0.5.0) or later installed:

```bash
# 1. First install and set up the complete verl environment
git clone --branch v0.5.0 https://github.com/volcengine/verl.git
cd verl

# 2. Clone this SuperRL components repository
git clone https://github.com/your-username/SuperRL-Opensource.git
```

**⚠️ Version Compatibility Note**: Different versions of verl may require modifications to the actor implementations. If you encounter compatibility issues when using newer or older versions of verl, you may need to adapt the actor code to match the specific verl version's API and interface requirements.

## 🔧 Integration Guide

> **📋 Note**: This guide provides step-by-step instructions for integrating SuperRL components into the verl project. All operations should be performed in the verl project root directory.

### 1. Reward Function Integration

**Step 1: Copy SuperRL reward function to verl directory**
```bash
# Copy SuperRL reward function to verl's reward_score directory
cp SuperRL-Opensource/reward/superrl.py verl/utils/reward_score/superrl.py
```

**Step 2: Modify verl's reward_score/__init__.py file**

Add SuperRL reward function support to the `verl/utils/reward_score/__init__.py` file. Add the following to the `default_compute_score` function:

```python
# Add to the default_compute_score function
elif data_source in ["openai/gsm8k", "lighteval/MATH", "DigitalLearningGmbH/MATH-lighteval", 
                     "HuggingFaceH4/MATH-500", "GAIR/LIMO", "meta-math/MetaMathQA", 
                     "open-r1/OpenR1-Math-220k", "horseee/MixChain-Z-PRM12K", "hitab"]:
    # Use SuperRL's universal reward function
    from . import superrl
    res = superrl.compute_score(solution_str, ground_truth)
```

**Step 3: Ensure correct reward function return format**

SuperRL's reward function already returns float type, which is compatible with verl standards.

### 2. Actor Implementation Integration

**Step 1: Copy SuperRL actors to verl workers directory**
```bash
# Copy all SuperRL actor implementations
cp SuperRL-Opensource/actor/* verl/workers/actor/
```

**Step 2: Modify fsdp_workers.py to support SuperRL Actor**

Modify the actor initialization part in the `init_model` method of `verl/workers/fsdp_workers.py`:

```python
# Modify in the init_model method of ActorRolloutRefWorker class
if self._is_actor:
    actor_cfg = omega_conf_to_dataclass(self.config.actor)
    
    # Check if using SuperRL Actor
    actor_type = self.config.actor.get("actor_type", "default")
    
    if actor_type == "superrl":
        from verl.workers.actor.SuperRLActor import SuperRLActor
        self.actor = SuperRLActor(
            config=actor_cfg, 
            actor_module=self.actor_module_fsdp, 
            tokenizer=self.tokenizer,
            actor_optimizer=self.actor_optimizer
        )
    elif actor_type == "hybrid_adv_gated":
        from verl.workers.actor.HybridAdvGatedActor import HybridAdvGatedActor
        self.actor = HybridAdvGatedActor(
            config=actor_cfg, 
            actor_module=self.actor_module_fsdp, 
            tokenizer=self.tokenizer,
            actor_optimizer=self.actor_optimizer
        )
    elif actor_type == "hybrid_log_sigma":
        from verl.workers.actor.HybridLogSigmaActor import HybridLogSigmaActor
        self.actor = HybridLogSigmaActor(
            config=actor_cfg, 
            actor_module=self.actor_module_fsdp, 
            tokenizer=self.tokenizer,
            actor_optimizer=self.actor_optimizer
        )
    else:
        # Default to original DataParallelPPOActor
        from verl.workers.actor import DataParallelPPOActor
        self.actor = DataParallelPPOActor(
            config=actor_cfg, 
            actor_module=self.actor_module_fsdp, 
            actor_optimizer=self.actor_optimizer
        )
```

**Step 3: Configuration file support**

Add actor_type parameter to training configuration:

```yaml
# Add to training configuration file
actor_rollout_ref:
  actor:
    actor_type: "default"  # Options: "superrl", "hybrid_adv_gated", "hybrid_log_sigma", "default"
    # SuperRL specific configurations
    sft_micro_batch_size: 2
    pg_signal_eps: 1e-8
    reward_eps: 1e-8
    sft_label_smoothing: 0.0

# Add HybridDataset configuration
data:
  use_hybrid_dataset: false  # Enable HybridDataset for SuperRL
  # Other data configurations...

```

### 3. Dataset Integration

**Step 1: Copy HybridDataset to verl's dataset module**
```bash
# Copy dataset components to verl's dataset module
cp SuperRL-Opensource/dataset/hybrid_dataset.py verl/verl/utils/dataset/
cp SuperRL-Opensource/dataset/response_dataproto_compose.py verl/verl/utils/dataset/
```

**Step 2: Add imports to verl's dataset __init__.py**

Add the following imports to `verl/verl/utils/dataset/__init__.py`:

```python
# Add these imports to verl/verl/utils/dataset/__init__.py
from .hybrid_dataset import HybridDataset
from .response_dataproto_compose import *
```

**Step 3: Modify main_ppo.py to support HybridDataset**

Add support for HybridDataset in the `create_rl_dataset` function in `verl/verl/trainer/main_ppo.py`:

```python
def create_rl_dataset(data_config, tokenizer, processor=None):
    """Create RL dataset with support for HybridDataset"""
    
    # Determine dataset class based on configuration or data characteristics
    use_hybrid_dataset = data_config.get("use_hybrid_dataset", False)  # Enable by default for SuperRL
    
    if use_hybrid_dataset:
        from verl.utils.dataset import HybridDataset
        dataset_cls = HybridDataset
    else:
        from verl.data.rlhf_dataset import RLHFDataset  # Default verl dataset
        dataset_cls = RLHFDataset
    
    # Extract data file paths
    data_paths = []
    if hasattr(data_config, 'train_files') and data_config.train_files:
        if isinstance(data_config.train_files, (list, ListConfig)):
            data_paths.extend(data_config.train_files)
        else:
            data_paths.append(data_config.train_files)
    
    # Instantiate the dataset using the determined dataset class
    dataset = dataset_cls(
        data_files=data_paths,
        tokenizer=tokenizer,
        processor=processor,
        config=data_config,
    )
    
    return dataset
```

**Step 3: Modify ray_trainer.py to handle additional batch keys**

Update the batch processing in `verl/verl/trainer/ray_trainer.py` to include the new keys:

```python
# In the ray_trainer.py file, find the batch_keys_to_pop definition and update it:
batch_keys_to_pop = ["input_ids", "attention_mask", "position_ids"]
non_tensor_batch_keys_to_pop = ["raw_prompt_ids", "tagged_answer"]

# Make sure the batch processing logic can handle these additional keys
def process_batch(batch):
    # Extract tensor keys
    for key in batch_keys_to_pop:
        if key in batch:
            # Process tensor keys as before
            pass
    
    # Extract non-tensor keys 
    for key in non_tensor_batch_keys_to_pop:
        if key in batch:
            # Handle non-tensor keys (like tagged_answer)
            # These should be preserved in non_tensor_batch
            pass
```

### 4. Data Preprocessing Setup

**Step 1: Copy data preprocessing scripts**
```bash
# Copy data preprocessing scripts to a suitable location
mkdir -p data_preprocessing
cp SuperRL-Opensource/data_preprocess/* data_preprocessing/
```

### 5. Training Script Modifications

**Step 1: Modify demo script to use SuperRL**

Update `demo_script/superrl_demo.sh`:

```bash
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files="$train_files" \
    data.val_files="$test_files" \

    # Using hybrid dataset for SuperRL
    data.use_hybrid_dataset=true \

    # ... keep other configurations unchanged ...
    
    # Add SuperRL specific configurations
    actor_rollout_ref.actor.actor_type=superrl \
    actor_rollout_ref.actor.sft_micro_batch_size=2 \
    actor_rollout_ref.actor.pg_signal_eps=1e-8 \
    actor_rollout_ref.actor.reward_eps=1e-8 \
    
    # Other configurations...
    trainer.total_epochs=100 $@
```
## Citation
If you find this repository useful, please considering giving ⭐ or citing:
```
@misc{liu2025superrlreinforcementlearningsupervision,
      title={SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning}, 
      author={Yihao Liu and Shuocheng Li and Lang Cao and Yuhang Xie and Mengyu Zhou and Haoyu Dong and Xiaojun Ma and Shi Han and Dongmei Zhang},
      year={2025},
      eprint={2506.01096},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.01096}, 
}
```

## Contributing

This project welcomes contributions and suggestions.  Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft 
trademarks or logos is subject to and must follow 
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.