Full Determinism for Reproducible RL Training

Authors: Haichuan Hu, Yongxiang Huang, Jiawei Zhang, Nguyen Long

Last updated: 06/16/2026.

Overview

By default, RL training in verl is not bitwise reproducible: identical configs run twice can produce different reward curves due to nondeterminism in GPU kernels, request scheduling, hash-based routing, and batch composition. The full determinism feature closes these gaps, enabling two identical runs to produce bitwise-aligned reward curves.

Useful for:

Debugging: reproduce a training failure exactly, step-by-step
Regression testing: verify that a code change has no silent effect on training outcomes
Research: ensure fair comparison when evaluating algorithmic changes

Quick Start

actor_rollout_ref:
  rollout:
    full_determinism: true
    seed: 42
  actor:
    fsdp_config:
      full_determinism: true
  ref:
    fsdp_config:
      full_determinism: true

reward:
  reward_model:
    enable: true
    rollout:
      full_determinism: true
      seed: 42

Or via Hydra overrides:

python -m verl.trainer.main_ppo \
  actor_rollout_ref.rollout.full_determinism=true \
  actor_rollout_ref.rollout.seed=42 \
  actor_rollout_ref.actor.fsdp_config.full_determinism=true \
  actor_rollout_ref.ref.fsdp_config.full_determinism=true \
  reward.reward_model.enable=true \
  reward.reward_model.rollout.full_determinism=true \
  reward.reward_model.rollout.seed=42 \
  [other config overrides...]

Important: PYTHONHASHSEED must be set before the Python interpreter starts. verl handles this automatically — it sets PYTHONHASHSEED from rollout.seed before ray.init() and propagates it to all Ray actors. Do NOT set it manually.

Configuration Reference

Parameter	Default	Scope	Description
`actor_rollout_ref.rollout.full_determinism`	`false`	Rollout	Enables deterministic rollout generation
`actor_rollout_ref.rollout.seed`	`42`	Rollout	Base seed; each replica uses `replica_rank + seed`
`actor_rollout_ref.actor.fsdp_config.full_determinism`	`false`	Actor	Enables deterministic PyTorch ops for actor
`actor_rollout_ref.ref.fsdp_config.full_determinism`	`false`	Ref model	Enables deterministic PyTorch ops for reference model
`reward.reward_model.rollout.full_determinism`	`false`	Reward model	Enables deterministic RM inference (forces `max_num_seqs=1`)
`reward.reward_model.rollout.seed`	`42`	Reward model	Base seed for RM vLLM server

How It Works

Determinism is enforced at five layers. All must be enabled for full E2E reproducibility:

PyTorch-level

enable_full_determinism(seed) sets PYTHONHASHSEED, CUBLAS_WORKSPACE_CONFIG, FLASH_ATTENTION_DETERMINISTIC, seeds all RNGs, calls torch.use_deterministic_algorithms(True, warn_only=True), and disables cuDNN benchmarking. Applied in all training engine implementations.

Environment propagation

main_ppo.run_ppo() sets three env vars before ray.init():

PYTHONHASHSEED — freezes Python hash() and dict ordering
VERL_FULL_DETERMINISM — signals subprocesses to apply determinism
VLLM_BATCH_INVARIANT — makes vLLM outputs independent of batch composition

These are forwarded to all Ray actors via PPO_RAY_RUNTIME_ENV.

vLLM batch invariance + per-request seed

VLLM_BATCH_INVARIANT=1 ensures vLLM outputs don’t depend on which other requests are batched together. Each generate() call injects SamplingParams.seed = replica_rank + config.seed to reset RNG per request.

Priority scheduling + deterministic routing

Without determinism, request order depends on arrival timing and server selection on dict iteration order. When full_determinism=true:

Each sample gets a globally unique priority injected into the batch (non_tensor_batch["priority"]), so each rollout request is scheduled with a stable, distinct priority
SingleTurnAgentLoop uses request_id=f"det-{priority}" instead of random UUID
GlobalRequestLoadBalancer tie-breaking uses hash(request_id) % len(candidates) — deterministic with frozen PYTHONHASHSEED

Note: priority is a vLLM-only parameter. LLMServerClient.generate() automatically filters it for non-vLLM backends.

Reward model serialization

vLLM’s /classify endpoint (used by RM) does not support priority or batch invariance. When full_determinism=true, RewardModelManager forces max_num_seqs=1, serializing RM inference one request at a time.

Side Effects and Limitations

Performance: deterministic PyTorch kernels are slower, cuDNN benchmarking is disabled, and RM max_num_seqs=1 causes severe throughput loss. Typical E2E throughput drops 10–30% without RM; RM determinism can drop significantly more.

Recommendation: Only enable for debugging, regression testing, or research. Leave disabled for production training.

Nondeterministic fallbacks: Some GPU ops have no deterministic implementation. torch.use_deterministic_algorithms(True, warn_only=True) warns when these are encountered.

Backend support:

Backend	Rollout	Reward model
vLLM	✅	✅ (serialized)
SGLang	❌	❌
TensorRT-LLM	❌	❌

PYTHONHASHSEED: Must be set before process start. verl handles this automatically; manual Ray actor creation must propagate it via runtime_env.

Data parallelism: Each replica uses replica_rank + seed, producing different but internally reproducible outputs. Two runs with the same config produce bitwise-aligned results.

Multi-turn agent not supported: Full determinism only works for single-turn rollouts (single_turn_agent_loop). Multi-turn rollouts (tool_agent_loop) are not bitwise reproducible, for two reasons:

tool_agent_loop uses a random UUID per trajectory as request_id and does not pass priority to server_manager.generate(), so the deterministic routing and priority scheduling described above do not apply.
Even with those added, each turn is interleaved with external tool calls whose execution time varies across runs, so the order in which requests arrive at vLLM cannot be made deterministic. This is inherent to multi-turn agentic workloads.

For bitwise-reproducible rollouts, use single_turn_agent_loop. (The per-request sampling seed is still applied to multi-turn requests inside the vLLM server, but this alone is not sufficient for end-to-end reproducibility.)

Verifying Determinism

Rollout determinism (bitwise reproducible vLLM generation):

VLLM_DETERMINISM_DENSE_MODEL_PATH=${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct \
VLLM_DETERMINISM_N_GPUS=2 \
pytest tests/workers/rollout/rollout_vllm/test_vllm_generation_determinism.py -v -s

E2E training (bitwise-aligned reward curves across two full PPO runs):

python tests/experimental/reward_loop/run_determinism_e2e_with_rm.py \
  --policy_model ~/models/Qwen/Qwen2.5-0.5B-Instruct \
  --rm_model ~/models/Skywork/Skywork-Reward-V2-Llama-3.2-1B \
  --train_files ~/data/gsm8k/train.parquet \
  --val_files ~/data/gsm8k/test.parquet \
  --n_gpus 2 --n_steps 5