Verl LLM Best Practices (DAPO + Qwen3-235B)
Last updated: 11/03/2025.
Purpose
This guide uses DAPO training on Qwen3-235B as a concrete example. We unpack every parameter that appears in the optimization objective, map it to Verl configuration entries, and share field-tested recommendations so you can derive sensible settings for your own workloads.
Note
The guide only covers the subset of parameters required to reproduce the DAPO experiments discussed here. For the full list, refer to the
configcomponents in the Verl source tree: https://github.com/verl-project/verl/tree/main/verl/trainer/configPPO and GRPO introduce KL-constrained policies. We therefore include that setup in the explanations below. You can treat all configurations mentioned here as a DAPO pipeline augmented with a KL penalty.
Optimization Objectives
DAPO objective
GRPO objective
Notation Overview
- \((q,a)\sim D\)
\(D\) denotes the training dataset. For each sample, \(q\) is the input prompt (for math tasks, the question) and \(a\) is the target output—typically the final answer without intermediate reasoning steps.
- \(G\)
Group size. For every prompt we sample \(G\) independent responses.
- \(\theta\)
Actor model parameters.
- \(\pi\)
Sampling strategy that bundles the rollout backend (vLLM, sglang, etc.) and all generation hyperparameters. Because LLMs generate tokens autoregressively, rollout dominates runtime, so backend-specific tuning is critical.
- \(\pi_\theta\)
Actor policy obtained by instantiating \(\pi\) with parameters \(\theta\).
- \(\hat{A}_{i,t}\)
Advantage of the \(i\)-th sample within the group at timestep \(t\).
- \(R_i\)
Reward assigned to the \(i\)-th sample in the group.
- \(\mathbb{D}_{KL}\)
KL divergence between two policies.
- \(\beta\)
Coefficient that weights the KL term.
- \(\pi_{old}\)
Frozen “old” policy, updated after every \(\texttt{train_batch_size}\) samples.
- \(\pi_{ref}\)
Reference policy used to compute the KL divergence.
- \(o_i, |o_i|\)
\(o_i\) is the generated output sequence for the \(i\)-th prompt; \(|o_i|\) is its token length.
- \(\pi_\theta(o_{i,t} \mid q_i, o_{i,<t})\)
Probability of emitting token \(o_{i,t}\) at timestep \(t\) given prompt \(q_i\) and the previously generated prefix under parameters \(\theta\). In practice, the rollout engine first generates full responses, then concatenates prompts and outputs for each model; with attention masks we can compute all token probabilities in one pass.
- \(\varepsilon_{low}\) and \(\varepsilon_{high}\)
Lower and upper clipping bounds for importance sampling. DAPO adopts a clip-higher strategy, so the upper bound is different from the lower bound to prevent overly large policy updates.
Parameter Reference
- \((q,a)\sim D\)
data.train_files/data.val_files: Training and validation datasets. They must be stored as.parquet. Use the conversion scripts underexamples/data_preprocessand make sure yourdata_sourceimplements the matching reward function. You can also reuse the HuggingFace datasetBytedTsinghua-SIA/DAPO-Math-17k.data.prompt_key: Column name for prompts. Keep the defaultpromptunless you have a clearer schema.data.max_prompt_length: Upper bound on prompt length. Set it to cover the longest prompt in the corpus; when long-tail samples make it too large, lower the value and combine withdata.truncation.data.truncation: Policy for over-length inputs (truncate-left/right or raise).leftworks for most runs. If training logs show largeclip_ratioand poor metrics, increasedata.max_prompt_lengthor clean the data. Set toerrorwhen strict validation is required.
- \(G\)
actor_rollout_ref.rollout.n: Number of generations per prompt. Typical values: GRPO 64, DAPO 16.
- \(\theta\)
actor_rollout_ref.model.path: Path to the actor checkpoint in HuggingFace-compatible format.actor_rollout_ref.actor.megatron.use_mbridge: Selects the model-weight backend for the Megatron checkpoint manager. WithTrue(default), model weights are saved/loaded in HuggingFace format via mbridge andhf_modelinsave_contentsis deduplicated againstmodel. WithFalse, model weights go through Megatron’s nativedist_checkpointingandhf_modelinsave_contentsis rejected (useverl.model_mergerafter training instead). Optimizer / LR-scheduler / RNG states always go throughdist_checkpointingregardless of this flag. The legacy aliasactor_rollout_ref.actor.megatron.use_dist_checkpointing=Truestill works and is equivalent touse_mbridge=False. See Using Checkpoints to Support Fault Tolerance Training for the full save/load behaviour matrix.actor_rollout_ref.actor.megatron.vanilla_mbridge: If set to True, use mbridge, else use Megatron-Bridge https://github.com/NVIDIA-NeMo/Megatron-Bridge. Now it is True by default. and it will defaultly be set to False in the future(v0.8).
- \(\pi\)
actor_rollout_ref.rollout.name: Rollout backend. Verl currently supportsvllmandsglang—benchmark and tune according to your infrastructure.actor_rollout_ref.rollout.response_length/data.max_response_length: Maximum generated tokens (rollout setting takes precedence). Larger values improve quality but consume more memory and latency. Monitorclip_ratio; values above 0.1 often mean you are truncating too much.actor_rollout_ref.rollout.gpu_memory_utilization: Target GPU memory usage during rollout. Push it as high as possible without triggering OOM; with parameter/gradient/optimizer offload enabled, 0.8–0.9 is common.actor_rollout_ref.rollout.tensor_model_parallel_size: Tensor parallel degree for the inference engine. Ensure(memory_per_gpu * gpu_memory_utilization * TP) > 2 * model_parameters(bf16/fp16). Increase TP gradually to expand KV cache capacity while watching communication cost—especially once TP > 8.actor_rollout_ref.rollout.temperature/top_p/top_k: Sampling knobs for rollout. Keep enough randomness;temperature=1.0,top_p=1.0,top_k=-1are good defaults.actor_rollout_ref.rollout.val_kwargs.temperature/top_p/top_k/do_sample/n: Sampling options for validation. Settemperature > 0to prevent repetitive thinking chains. For small test sets (e.g., AIME24) raisen(64 is a common choice) to reduce variance. A practical starting point istemperature=1.0,top_p=0.7,top_k=-1,do_sample=True,n=1and then increasenas needed.+actor_rollout_ref.rollout.engine_kwargs.vllm.*/+actor_rollout_ref.rollout.engine_kwargs.sglang.*: Extra backend options injected via the+syntax. Consult backend docs for exact semantics. Some switches (for examplepipeline_parallel_size) may not be supported yet; when TP=32,enable_expert_parallel=Truecan even slow down DeepSeek-V3 rollout, so benchmark carefully.
- \(\pi_\theta\)
data.train_batch_size: Total batch size per training iteration. Each rollout producestrain_batch_size * nsamples. Larger values reduce the number of rollouts but increase off-policy drift.actor_rollout_ref.actor.ppo_mini_batch_size: Mini-batch size per optimization step. Tune it the same way you would for standard deep learning workloads.actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu: Samples processed per forward pass on one GPU group (a Megatron group contains TP * PP * CP GPUs). Keep it ≤ppo_mini_batch_sizeand as large as memory allows.actor_rollout_ref.actor.use_dynamic_bsz: Enable dynamic batch sizing to adapt to sequence length and improve throughput.actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu: Maximum tokens per GPU when computing log probabilities under dynamic batching. Set it to at least a multiple ofmax_prompt_length + max_response_lengthto prevent truncation.Megatron parallelism parameters (
pipeline_model_parallel_size/tensor_model_parallel_size/expert_model_parallel_size/expert_tensor_parallel_size/context_parallel_size): Balance PP/TP/EP/ETP/CP to match memory and network constraints. In bf16/fp16, each parameter consumes roughly2 / TPbytes; if you keep FP32 master weights or skip optimizer offload, reserve another 4–8 bytes for Adam. Activations scale withmicro_batch_size × sequence_length × hidden_sizeand can be mitigated with gradient checkpointing, dynamic batches, or offload. Prefer increasing TP first, add PP when necessary, extend sequence capacity with CP, align EP/ETP with TP for MoE models, and keep DP minimal on constrained clusters while combining with offload. Always align the setup with hardware topology and communication cost.actor_rollout_ref.model.use_fused_kernels: Enable Verl’s fused kernels for supported models to squeeze out additional performance.
- \(\hat{A}_{i,t}\)
algorithm.adv_estimator: Advantage estimator. Set togrpofor DAPO/GRPO.
- \(R_i\)
reward_model.reward_manager: Reward aggregation strategy. Usedapofor DAPO andnaivefor GRPO.
- \(D_{KL}\)
algorithm.use_kl_in_reward: Whether to add a KL term to the reward.Truefor PPO,Falsefor GRPO and DAPO.actor_rollout_ref.actor.use_kl_loss: Whether to include a KL loss term.Falsefor PPO,Truefor GRPO,Falsefor DAPO.
- \(\beta\)
actor_rollout_ref.actor.kl_loss_coef: Weight of the KL loss. Start around 0.001. Larger values curb reward hacking but reduce exploration.algorithm.kl_ctrl.kl_coef: KL coefficient applied within the reward. Adjust to match your tolerance for divergence.
- \(\pi_{old}\)
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz: Enable dynamic batching when the old policy computes log-probabilities. Recommended.
- \(\pi_{ref}\)
actor_rollout_ref.ref.log_prob_use_dynamic_bsz: Enable dynamic batching for the reference policy. Recommended.Reference Megatron parallelism: Keep
pipeline_model_parallel_size,tensor_model_parallel_size,expert_model_parallel_size,expert_tensor_parallel_size, andcontext_parallel_sizein sync with the actor.actor_rollout_ref.ref.megatron.param_offload: Offload reference parameters to CPU when the actor does so. Even without gradients or optimizer states, parity helps with capacity planning.
- \(o_i\) / \(|o_i|\)
actor_rollout_ref.actor.loss_agg_mode: Loss aggregation mode. Token-leveltoken-meanmatches the recommendations from Dr.GRPO and DAPO; useseq-mean-token-meanto reproduce the original GRPO behavior.
- \(\pi_\theta(o_{i,t} \mid q_i,o_{i,<t})\)
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu/actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu: Batch size while computing token probabilities. Rollout engines generate outputs and then concatenate inputs for each model, so balance memory against throughput.
- \(\epsilon_{low}\) / \(\epsilon_{high}\)
actor_rollout_ref.actor.clip_ratio_low/actor_rollout_ref.actor.clip_ratio_high: Importance sampling clipping bounds. For DAPO, useclip_ratio_low=0.2andclip_ratio_high=0.28.
- vLLM inference optimizations
actor_rollout_ref.rollout.enable_chunked_prefill: Enables chunked prefill to boost GPU utilization (vLLM only). Tune together withmax_num_batched_tokens.actor_rollout_ref.rollout.max_num_batched_tokens: Maximum tokens per batch. A practical rule of thumb ismax(8192, max_prompt_length + max_response_length, max_model_len); see the vLLM docs for details.actor_rollout_ref.rollout.enforce_eager: Disables CUDA graphs. By default vLLM leverages CUDA graphs for speed at the cost of extra memory (not limited bygpu_memory_utilization); set this toTruewhen memory is tight.actor_rollout_ref.rollout.cudagraph_capture_sizes: Explicit capture batch sizes for CUDA graphs. Default isnull; on memory-constrained systems try[1, 2, 4, 8, 16, 32].
- Optimizer settings
actor_rollout_ref.actor.optim.lr: Learning rate. Start around1e-5or1e-6.actor_rollout_ref.actor.optim.lr_warmup_steps: Number of warmup steps (e.g., 10).actor_rollout_ref.actor.optim.weight_decay: Weight decay coefficient, typically 0.1.actor_rollout_ref.actor.optim.clip_grad: Gradient clipping threshold, commonly 1.+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction: Portion of optimizer updates executed on CPU. Large models such as DeepSeek benefit from enabling it with value 1.+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d/+...use_precision_aware_optimizer/+...optimizer_cpu_offload: Companion switches for hybrid optimizers. Turn them on alongside CPU offload.
- Megatron-related parameters
actor_rollout_ref.actor.megatron.param_offload/optimizer_offload/grad_offload: Offload parameters, optimizer states, and gradients to CPU when GPU memory is insufficient.+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method/recompute_granularity/recompute_num_layers: Gradient checkpointing controls. Enable (e.g.,uniform,full,1) to trade computation for memory.+actor_rollout_ref.actor.megatron.override_transformer_config.moe_router_dtype/moe_shared_expert_overlap/moe_permute_fusion/moe_enable_deepep/moe_token_dispatcher_type: Recommended MoE knobs (sample values:fp32,False,True,True,flex) for stable performance.+actor_rollout_ref.actor.megatron.override_transformer_config.gradient_accumulation_fusion: Enables fused gradient accumulation for additional speedup.+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split/account_for_loss_in_pipeline_split/num_layers_in_last_pipeline_stage: Pipeline-parallel adjustments when layer counts do not divide evenly. Treat embedding and loss as standalone stages and setnum_layers_in_last_pipeline_stage(0 or${LAST_LAYER}) when you need manual control.
- Trainer
trainer.logger: Logging backends. Use['console', 'wandb']or, on Volcano Engine ML Platform,['console', 'vemlp_wandb'].trainer.project_name/trainer.experiment_name: Hierarchical naming for projects and experiments so you can locate runs quickly.trainer.n_gpus_per_node/trainer.nnodes: Number of GPUs per node and total node count. Match your cluster allocation.trainer.test_freq/trainer.save_freq/trainer.total_epochs: Evaluation interval, checkpoint interval, and total epochs—configure for your SLA.trainer.log_val_generations: Number of validation samples stored in logs. Start with 10 and adjust as needed.trainer.val_before_train: Run validation before training begins when you require a baseline checkpoint.