verl

Quickstart

Installation
Quickstart: PPO training on GSM8K dataset
Multinode Training
Ray Debug Tutorial
More Resources
Agentic RL Training

Programming guide

How to Extend verl
HybridFlow Programming Guide
The Design of verl.single_controller

Data Preparation

Prepare Data for Post-Training
Implement Reward Function for Dataset

Configurations

Config Explanation

PPO Example

PPO Example Architecture
GSM8K Example
Megatron-FSDP Example
Multi-Modal Example Architecture
SkyPilot Examples

Algorithms

Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)
Recipe: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
Recipe: Self-Play Fine-Tuning (SPIN)
Recipe: Self-Play Preference Optimization (SPPO)
Recipe: Entropy Mechanism
On-Policy RL with Optimal Reward Baseline (OPO)
Algorithm Baselines
GPG: Group Policy Gradient
Rollout Correction
Mathematical Formulations of Rollout Correction Methods in verl
Optimal Token Baseline (OTB)
Divergence Proximal Policy Optimization (DPPO)
On-Policy Distillation (OPD)

PPO Trainer and Workers

PPO Ray Trainer
Model Engine
Engine Workers
Automodel Backend
TorchTitan Backend
SGLang Backend
TensorRT-LLM Backend

Performance Tuning Guide

Training DeepSeek 671b
Verl LLM Best Practices (DAPO + Qwen3-235B)
Performance Tuning Guide
Rollout KV Cache Offload via Mooncake-Store
Upgrading to vLLM >= 0.8
Hardware Resource Needed for RL
verl Profiler System
NVIDIA Nsight Systems profiling in verl
PyTorch Profiling in verl

Adding new models

Add models with the FSDP backend
Add models with the Megatron-LM backend
Megatron Lite backend

Async Training

Recipe: One Step Off Policy Async Trainer
Recipe: Fully Async Policy Trainer
Recipe: Async On-Policy Knowledge Distillation Trainer

Low Precision

FP8 RL in verl
NVFP4 QAT (Quantization-Aware Training) in verl

Advanced Features

Using Checkpoints to Support Fault Tolerance Training
RoPE Scaling override
Attention Implementation Override
RL(HF) algorithms with LoRA Support
Multi-turn Rollout Support
Ray API Design Tutorial
Extend to other RL(HF) algorithms
Sandbox Fusion Example
Trace Function Usage Instructions
SkipManager: Skip everything in the RL pipeline.
Agent Loop
Reward Loop
TransferQueue Data System
Use Prometheus and Grafana to Monitor Rollout
Guide to Using MTP in SFT/RL Training and Inference
Full Determinism for Reproducible RL Training

Hardware Support

Multi-Chip Support
AMD (ROCm) Tutorial
Ascend (NPU) Tutorial

API References

Data interface
Single Controller interface
Trainer Interface
Utilities

Blog

verl 0.7 release blog

FAQ

Frequently Asked Questions

Contributing

Editing Agent Instructions

Development Notes

Sandbox Fusion Tool Integration

verl

Multi-Modal Example Architecture
View page source

Multi-Modal Example Architecture

Last updated: 04/28/2025.

Introduction

Now, verl has supported multi-modal training. You can use fsdp and vllm/sglang to start a multi-modal RL task. Megatron supports is also on the way.

Follow the steps below to quickly start a multi-modal RL task.

Step 1: Prepare dataset

# it will be saved in the $HOME/data/geo3k folder
python examples/data_preprocess/geo3k.py

Step 2: Download Model

# download the model from huggingface
python3 -c "import transformers; transformers.pipeline(model='Qwen/Qwen2.5-VL-7B-Instruct')"

Step 3: Perform GRPO training with multi-modal model on Geo3K Dataset

# run the task
bash examples/grpo_trainer/run_qwen2_5_vl_7b_fsdp.sh

Previous Next

© Copyright 2024 ByteDance Seed Foundation MLSys Team.

Built with Sphinx using a theme provided by Read the Docs.