Recipe: Self-Play Fine-Tuning (SPIN)

verl provides a recipe inspired by the paper “Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models” (SPIN). SPIN is a language model finetuning algorithm that enables iterative self-improvement through a self-play mechanism inspired by game theory.

Core Idea: Models learn by playing against themselves, reducing reliance on external preference datasets or stronger teacher models:

Synthetic Data Generation: The current model generates responses, creating its own training data from previous iterations.
Two-Player Game Setup: A game involving two players acted by a single LLM.
Iterative Training: The model progressively improves by refining its policy, with each iteration’s model becoming the opponent for the next iteration.

Paper Authors: Zixiang Chen*, Yihe Deng*, Huizhuo Yuan*, Kaixuan Ji, Quanquan Gu

[Webpage] [Huggingface] [Paper] [Original Implementation]

verl Implementation Authors: Chendong Wang, Chenyang Zhao

Our Online DPO Implementation

Our compute_online_dpo_loss function adapts verl’s existing PPO infrastructure (based on verl v0.3.0.post1) for this iterative online DPO. Key aspects of our implementation include:

No Critic: Unlike PPO, we omit the value function critic.
Dynamic Reference Model: An explicit reference policy (ref_policy_wg) is used for DPO loss. This reference model’s weights can be periodically updated from the actor (ref_update_freq), providing a dynamic baseline.
Online Preference Generation: The compute_onlineDPO_pref function (in core_algos.py) dynamically creates chosen/rejected pairs based on a reward source (e.g., rule-based ranking for math problems).
DPO Loss Integration: We replace PPO’s policy loss with our compute_online_dpo_loss (in core_algos.py) within the actor update (dp_actor.py), directly optimizing the policy using the generated preferences.
Iterative Training Orchestration: The SpinTrainer (in spin_trainer.py) manages the entire self-play loop: generation, preference labeling, optional reference model updates, and policy updates, enabling continuous self-improvement aligned with SPIN’s principles.

Algorithm

This recipe implements an Online algorithm adapted to the verl Reinforcement Learning framework, which provides an alternative to PPO for fine-tuning language models.

Online Loop: Instead of maximizing a scalar reward signal in PPO, this approach directly optimizes the policy model to align with preference data generated online during training:

Generation: The current model generates multiple responses for each prompt in a batch.
Preference Labeling: A function evaluates these generated responses to determine which one is preferred (chosen) and which is dispreferred (rejected). This can be done using a reward function or implicit ranking based on specific rules. (In this recipe, we use rule-based ranking on the math problem).
Update: This preference tuple (prompt, chosen_response, rejected_response) is used to update the actor model using compute_online_dpo_loss, comparing against a reference model.

Connection with SPIN: Instead of only using a fixed target data distribution, the online generation loop in step 2 will dynamically change the target data distribution by using a certain Preference Labeling method (rule-based ranking on the math problem by selecting the better one in this recipe). This explores the direction mentioned in SPIN’s paper Section 7 about “dynamically changing target data distribution” to potentially elevate LLM performance beyond the fixed human-annotated data ceiling.

Reproduce the Experiment (Example Setup)

The following steps outline how to set up the environment and run the SPIN recipe, based on the provided test log using GSM8K and Qwen2.5-3B-Instruct.

Setup Environment (Example using Docker):

# Start a container with GPU access and shared memory
docker run -it --name spin_test --gpus all \
    --shm-size=32g \
    --ipc=host \
    -v /path/to/host/.cache:/root/.cache \
    -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
    lmsysorg/sglang:latest \
    /bin/bash

# Inside the container or on your host machine:
# Ensure /tmp is writable
mkdir -p /tmp
chmod 1777 /tmp

# Install Python 3.10 (if not present) and venv
sudo apt update
sudo apt install -y python3.10 python3.10-venv tmux
python3 -m ensurepip --upgrade

# Create and activate a virtual environment
python3 -m venv ~/.python/spin_env
source ~/.python/spin_env/bin/activate

# Install uv (fast package installer)
python3 -m pip install uv

Install verl and Dependencies:

# Clone the verl repository and checkout the spin branch
cd ~
git clone git@github.com:volcengine/verl.git && cd verl

# Install flash-attn (handle potential build issues)
python3 -m uv pip install wheel packaging
python3 -m uv pip install flash-attn --no-build-isolation --no-deps

# Install verl with sglang extras
python3 -m uv pip install -e ".[sglang]"

Note: If flash-attn installation fails, try the manual steps again or consult its documentation.

Login & Download Data/Model:

# Login to Weights & Biases (optional, for logging)
export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
# wandb login

# Download the GSM8K dataset
python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k # Adjusted path

# Download the base model (Example: Qwen2.5-3B-Instruct)
huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct

Configure:
- Modify the configuration file (e.g., config/spin_trainer.yaml or the one specified in the run script) with correct paths to your downloaded model, data, desired hyperparameters (dpo_beta, learning rate, etc.), and distributed training settings (nodes, GPUs per node).
- Pay attention to actor_rollout_ref.model_path, data paths, reward_model config (if using one), and trainer.ref_update_freq.

Run Training:

# Set CUDA visible devices (adjust based on your hardware and config)
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Launch the training script (e.g., test.sh or a custom script)
# Ensure test.sh points to the correct config and main script
bash recipe/spin/run_spin.sh

Configuration

The primary configuration is typically managed through a YAML file specified in the launch script (e.g., config/spin_trainer.yaml).
Key configuration sections:
- data: Paths to training/validation prompt files, batch sizes, sequence lengths.
- actor_rollout_ref: Paths to the base model (used for actor and initial reference), FSDP settings, optimization parameters (learning rate, scheduler).
- reward_model: Configuration for the reward model used for online preference labeling (path, batch size, etc.). Can be omitted if using a simpler reward function.
- algorithm: DPO-specific hyperparameters like dpo_beta, dpo_loss_type.
- trainer: Distributed training settings (nodes, GPUs per node), logging (WandB), checkpointing frequency, and ref_update_freq (set > 0 to enable periodic reference model updates from the actor).

Key Files

main_spin.py: Main entry point using Hydra to load the config and launch the SpinTrainer.
spin_trainer.py: Defines the SpinTrainer class, orchestrating the Online DPO training loop.
fsdp_workers.py: Implements Ray workers (Actor, Reference) potentially using FSDP.
dp_actor.py: Contains the actor class, including the DPO policy update logic.
core_algos.py: Includes helper functions for compute_online_dpo_loss and compute_onlineDPO_pref.
config/spin_trainer.yaml (or similar): Main Hydra configuration file for the recipe.
run_spin.sh (or similar): Example bash script for launching a training run.
README.md: This file.

Acknowledgement

We sincerely thank the contribution and guidance from the verl community and advisors, including (adapted from SPPO):