verl

Quickstart

Installation
Quickstart: PPO training on GSM8K dataset
Multinode Training
Ray Debug Tutorial
More Resources
Agentic RL Training

Programming guide

How to Extend verl
HybridFlow Programming Guide
The Design of verl.single_controller

Data Preparation

Prepare Data for Post-Training
Implement Reward Function for Dataset

Configurations

Config Explanation

PPO Example

PPO Example Architecture
GSM8K Example
Megatron-FSDP Example
Multi-Modal Example Architecture
SkyPilot Examples

Algorithms

Proximal Policy Optimization (PPO)
Group Relative Policy Optimization (GRPO)
Recipe: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
Recipe: Self-Play Fine-Tuning (SPIN)
Recipe: Self-Play Preference Optimization (SPPO)
Recipe: Entropy Mechanism
On-Policy RL with Optimal Reward Baseline (OPO)
Algorithm Baselines
GPG: Group Policy Gradient
Rollout Correction
Mathematical Formulations of Rollout Correction Methods in verl
Optimal Token Baseline (OTB)
Divergence Proximal Policy Optimization (DPPO)
On-Policy Distillation (OPD)

PPO Trainer and Workers

PPO Ray Trainer
Model Engine
Engine Workers
Automodel Backend
TorchTitan Backend
SGLang Backend
TensorRT-LLM Backend

Performance Tuning Guide

Training DeepSeek 671b
Verl LLM Best Practices (DAPO + Qwen3-235B)
Performance Tuning Guide
Rollout KV Cache Offload via Mooncake-Store
Upgrading to vLLM >= 0.8
Hardware Resource Needed for RL
verl Profiler System
NVIDIA Nsight Systems profiling in verl
PyTorch Profiling in verl

Adding new models

Add models with the FSDP backend
Add models with the Megatron-LM backend
Adding DeepSeek V4 support
Megatron Lite backend

Async Training

Recipe: One Step Off Policy Async Trainer
Delta Weight Sync
Recipe: Fully Async Policy Trainer
Recipe: Async On-Policy Knowledge Distillation Trainer
Dynamic Resource Scheduling for Fully-Async Training

Low Precision

FP8 RL in verl
NVFP4 QAT (Quantization-Aware Training) in verl

Advanced Features

Using Checkpoints to Support Fault Tolerance Training
RoPE Scaling override
Attention Implementation Override
RL(HF) algorithms with LoRA Support
Multi-turn Rollout Support
Ray API Design Tutorial
Extend to other RL(HF) algorithms
Sandbox Fusion Example
Trace Function Usage Instructions
Use RL-Insight to Monitor Training
SkipManager: Skip everything in the RL pipeline.
Agent Loop
Reward Loop
TransferQueue Data System
Use Prometheus and Grafana to Monitor Rollout
Guide to Using MTP in SFT/RL Training and Inference
Full Determinism for Reproducible RL Training

Hardware Support

Multi-Chip Support
AMD (ROCm) Tutorial
Ascend (NPU) Tutorial

API References

Data interface
Single Controller interface
Trainer Interface
Utilities

Blog

verl 0.7 release blog

FAQ

Frequently Asked Questions

Contributing

Editing Agent Instructions

Development Notes

Sandbox Fusion Tool Integration

verl

Ascend (NPU) Tutorial
View page source

Ascend (NPU) Tutorial

Last updated: 06/05/2026.

Getting Started

Ascend Tutorial
Ascend Dockerfile Build Guidance
Ascend Install Guidance
Ascend Quickstart
关键版本支持与依赖
环境安装步骤

Feature Support

Ascend Backend Features Guide
NPU 高级特性指南

Model Support

NPU Model & Algorithms Support Status
Ascend ReTool Best Practice
Ascend SGLang Best Practice
Ascend vLLM Best Practice
DAPO multi model optimization practice
NPU Qwen3-32B GSPO Optimization Practice
Qwen3.5 Megatron NPU 使用指南

Developer Guide

模型评测
训练配置参数与指标说明
Transfer to NPU guide
Precision Alignment
Precision Debugger (msprobe) in verl
Ascend Performance Analysis Guide
Performance Tuning Guide on Ascend
Profiling采集指导
Profiling Data Collection Guide

FAQ & Contributing

NPU 常见问题解答
NPU-CI 添加指导

Previous Next

© Copyright 2024 ByteDance Seed Foundation MLSys Team.

Built with Sphinx using a theme provided by Read the Docs.