Recipe: Self-Play Fine-Tuning (SPIN)

verl provides a recipe inspired by the paper ā€œSelf-Play Fine-Tuning Converts Weak Language Models to Strong Language Modelsā€ (SPIN). SPIN is a language model finetuning algorithm that enables iterative self-improvement through a self-play mechanism inspired by game theory.

Core Idea: Models learn by playing against themselves, reducing reliance on external preference datasets or stronger teacher models:

  1. Synthetic Data Generation: The current model generates responses, creating its own training data from previous iterations.

  2. Two-Player Game Setup: A game involving two players acted by a single LLM.

  3. Iterative Training: The model progressively improves by refining its policy, with each iteration’s model becoming the opponent for the next iteration.

Paper Authors: Zixiang Chen*, Yihe Deng*, Huizhuo Yuan*, Kaixuan Ji, Quanquan Gu

[Webpage] [Huggingface] [Paper] [Original Implementation]

verl Implementation Authors: Chendong Wang, Chenyang Zhao


Our Online DPO Implementation

Our compute_online_dpo_loss function adapts verl’s existing PPO infrastructure (based on verl v0.3.0.post1) for this iterative online DPO. Key aspects of our implementation include:

  • No Critic: Unlike PPO, we omit the value function critic.

  • Dynamic Reference Model: An explicit reference policy (ref_policy_wg) is used for DPO loss. This reference model’s weights can be periodically updated from the actor (ref_update_freq), providing a dynamic baseline.

  • Online Preference Generation: The compute_onlineDPO_pref function (in core_algos.py) dynamically creates chosen/rejected pairs based on a reward source (e.g., rule-based ranking for math problems).

  • DPO Loss Integration: We replace PPO’s policy loss with our compute_online_dpo_loss (in core_algos.py) within the actor update (dp_actor.py), directly optimizing the policy using the generated preferences.

  • Iterative Training Orchestration: The SpinTrainer (in spin_trainer.py) manages the entire self-play loop: generation, preference labeling, optional reference model updates, and policy updates, enabling continuous self-improvement aligned with SPIN’s principles.


Algorithm

This recipe implements an Online algorithm adapted to the verl Reinforcement Learning framework, which provides an alternative to PPO for fine-tuning language models.

Online Loop: Instead of maximizing a scalar reward signal in PPO, this approach directly optimizes the policy model to align with preference data generated online during training:

  1. Generation: The current model generates multiple responses for each prompt in a batch.

  2. Preference Labeling: A function evaluates these generated responses to determine which one is preferred (chosen) and which is dispreferred (rejected). This can be done using a reward function or implicit ranking based on specific rules. (In this recipe, we use rule-based ranking on the math problem).

  3. Update: This preference tuple (prompt, chosen_response, rejected_response) is used to update the actor model using compute_online_dpo_loss, comparing against a reference model.

Connection with SPIN: Instead of only using a fixed target data distribution, the online generation loop in step 2 will dynamically change the target data distribution by using a certain Preference Labeling method (rule-based ranking on the math problem by selecting the better one in this recipe). This explores the direction mentioned in SPIN’s paper Section 7 about ā€œdynamically changing target data distributionā€ to potentially elevate LLM performance beyond the fixed human-annotated data ceiling.


Reproduce the Experiment (Example Setup)

The following steps outline how to set up the environment and run the SPIN recipe, based on the provided test log using GSM8K and Qwen2.5-3B-Instruct.

  1. Setup Environment (Example using Docker):

    # Start a container with GPU access and shared memory
    docker run -it --name spin_test --gpus all \
        --shm-size=32g \
        --ipc=host \
        -v /path/to/host/.cache:/root/.cache \
        -e HF_TOKEN=<YOUR_HUGGINGFACE_TOKEN> \
        lmsysorg/sglang:latest \
        /bin/bash
    
    # Inside the container or on your host machine:
    # Ensure /tmp is writable
    mkdir -p /tmp
    chmod 1777 /tmp
    
    # Install Python 3.10 (if not present) and venv
    sudo apt update
    sudo apt install -y python3.10 python3.10-venv tmux
    python3 -m ensurepip --upgrade
    
    # Create and activate a virtual environment
    python3 -m venv ~/.python/spin_env
    source ~/.python/spin_env/bin/activate
    
    # Install uv (fast package installer)
    python3 -m pip install uv
    
  2. Install verl and Dependencies:

    # Clone the verl repository and checkout the spin branch
    cd ~
    git clone git@github.com:volcengine/verl.git && cd verl
    
    # Install flash-attn (handle potential build issues)
    python3 -m uv pip install wheel packaging
    python3 -m uv pip install flash-attn --no-build-isolation --no-deps
    
    # Install verl with sglang extras
    python3 -m uv pip install -e ".[sglang]"
    

    Note: If flash-attn installation fails, try the manual steps again or consult its documentation.

  3. Login & Download Data/Model:

    # Login to Weights & Biases (optional, for logging)
    export WANDB_API_KEY=<YOUR_WANDB_API_KEY>
    # wandb login
    
    # Download the GSM8K dataset
    python3 examples/data_preprocess/gsm8k.py --local_dir ~/data/gsm8k # Adjusted path
    
    # Download the base model (Example: Qwen2.5-3B-Instruct)
    huggingface-cli download Qwen/Qwen2.5-3B-Instruct --local-dir $HOME/models/Qwen2.5-3B-Instruct
    
  4. Configure:

    • Modify the configuration file (e.g., config/spin_trainer.yaml or the one specified in the run script) with correct paths to your downloaded model, data, desired hyperparameters (dpo_beta, learning rate, etc.), and distributed training settings (nodes, GPUs per node).

    • Pay attention to actor_rollout_ref.model_path, data paths, reward_model config (if using one), and trainer.ref_update_freq.

  5. Run Training:

    # Set CUDA visible devices (adjust based on your hardware and config)
    export CUDA_VISIBLE_DEVICES=0,1,2,3
    
    # Launch the training script (e.g., test.sh or a custom script)
    # Ensure test.sh points to the correct config and main script
    bash recipe/spin/run_spin.sh
    

Configuration

  • The primary configuration is typically managed through a YAML file specified in the launch script (e.g., config/spin_trainer.yaml).

  • Key configuration sections:

    • data: Paths to training/validation prompt files, batch sizes, sequence lengths.

    • actor_rollout_ref: Paths to the base model (used for actor and initial reference), FSDP settings, optimization parameters (learning rate, scheduler).

    • reward_model: Configuration for the reward model used for online preference labeling (path, batch size, etc.). Can be omitted if using a simpler reward function.

    • algorithm: DPO-specific hyperparameters like dpo_beta, dpo_loss_type.

    • trainer: Distributed training settings (nodes, GPUs per node), logging (WandB), checkpointing frequency, and ref_update_freq (set > 0 to enable periodic reference model updates from the actor).


Key Files

  • main_spin.py: Main entry point using Hydra to load the config and launch the SpinTrainer.

  • spin_trainer.py: Defines the SpinTrainer class, orchestrating the Online DPO training loop.

  • fsdp_workers.py: Implements Ray workers (Actor, Reference) potentially using FSDP.

  • dp_actor.py: Contains the actor class, including the DPO policy update logic.

  • core_algos.py: Includes helper functions for compute_online_dpo_loss and compute_onlineDPO_pref.

  • config/spin_trainer.yaml (or similar): Main Hydra configuration file for the recipe.

  • run_spin.sh (or similar): Example bash script for launching a training run.

  • README.md: This file.


Acknowledgement

We sincerely thank the contribution and guidance from the verl community and advisors, including (adapted from SPPO):