Full Determinism for Reproducible RL Training
Authors: Haichuan Hu, Yongxiang Huang, Jiawei Zhang, Nguyen Long
Last updated: 06/16/2026.
Overview
By default, RL training in verl is not bitwise reproducible: identical configs run twice can produce different reward curves due to nondeterminism in GPU kernels, request scheduling, hash-based routing, and batch composition. The full determinism feature closes these gaps, enabling two identical runs to produce bitwise-aligned reward curves.
Useful for:
Debugging: reproduce a training failure exactly, step-by-step
Regression testing: verify that a code change has no silent effect on training outcomes
Research: ensure fair comparison when evaluating algorithmic changes
Quick Start
actor_rollout_ref:
rollout:
full_determinism: true
seed: 42
actor:
fsdp_config:
full_determinism: true
ref:
fsdp_config:
full_determinism: true
reward:
reward_model:
enable: true
rollout:
full_determinism: true
seed: 42
Or via Hydra overrides:
python -m verl.trainer.main_ppo \
actor_rollout_ref.rollout.full_determinism=true \
actor_rollout_ref.rollout.seed=42 \
actor_rollout_ref.actor.fsdp_config.full_determinism=true \
actor_rollout_ref.ref.fsdp_config.full_determinism=true \
reward.reward_model.enable=true \
reward.reward_model.rollout.full_determinism=true \
reward.reward_model.rollout.seed=42 \
[other config overrides...]
Important:
PYTHONHASHSEEDmust be set before the Python interpreter starts. verl handles this automatically — it setsPYTHONHASHSEEDfromrollout.seedbeforeray.init()and propagates it to all Ray actors. Do NOT set it manually.
Configuration Reference
Parameter |
Default |
Scope |
Description |
|---|---|---|---|
|
|
Rollout |
Enables deterministic rollout generation |
|
|
Rollout |
Base seed; each replica uses |
|
|
Actor |
Enables deterministic PyTorch ops for actor |
|
|
Ref model |
Enables deterministic PyTorch ops for reference model |
|
|
Reward model |
Enables deterministic RM inference (forces |
|
|
Reward model |
Base seed for RM vLLM server |
How It Works
Determinism is enforced at five layers. All must be enabled for full E2E reproducibility:
PyTorch-level
enable_full_determinism(seed) sets PYTHONHASHSEED, CUBLAS_WORKSPACE_CONFIG, FLASH_ATTENTION_DETERMINISTIC, seeds all RNGs, calls torch.use_deterministic_algorithms(True, warn_only=True), and disables cuDNN benchmarking. Applied in all training engine implementations.
Environment propagation
main_ppo.run_ppo() sets three env vars before ray.init():
PYTHONHASHSEED— freezes Python hash() and dict orderingVERL_FULL_DETERMINISM— signals subprocesses to apply determinismVLLM_BATCH_INVARIANT— makes vLLM outputs independent of batch composition
These are forwarded to all Ray actors via PPO_RAY_RUNTIME_ENV.
vLLM batch invariance + per-request seed
VLLM_BATCH_INVARIANT=1 ensures vLLM outputs don’t depend on which other requests are batched together. Each generate() call injects SamplingParams.seed = replica_rank + config.seed to reset RNG per request.
Priority scheduling + deterministic routing
Without determinism, request order depends on arrival timing and server selection on dict iteration order. When full_determinism=true:
Each sample gets a globally unique priority injected into the batch (
non_tensor_batch["priority"]), so each rollout request is scheduled with a stable, distinct prioritySingleTurnAgentLoopusesrequest_id=f"det-{priority}"instead of random UUIDGlobalRequestLoadBalancertie-breaking useshash(request_id) % len(candidates)— deterministic with frozenPYTHONHASHSEED
Note:
priorityis a vLLM-only parameter.LLMServerClient.generate()automatically filters it for non-vLLM backends.
Reward model serialization
vLLM’s /classify endpoint (used by RM) does not support priority or batch invariance. When full_determinism=true, RewardModelManager forces max_num_seqs=1, serializing RM inference one request at a time.
Side Effects and Limitations
Performance: deterministic PyTorch kernels are slower, cuDNN benchmarking is disabled, and RM max_num_seqs=1 causes severe throughput loss. Typical E2E throughput drops 10–30% without RM; RM determinism can drop significantly more.
Recommendation: Only enable for debugging, regression testing, or research. Leave disabled for production training.
Nondeterministic fallbacks: Some GPU ops have no deterministic implementation. torch.use_deterministic_algorithms(True, warn_only=True) warns when these are encountered.
Backend support:
Backend |
Rollout |
Reward model |
|---|---|---|
vLLM |
✅ |
✅ (serialized) |
SGLang |
❌ |
❌ |
TensorRT-LLM |
❌ |
❌ |
PYTHONHASHSEED: Must be set before process start. verl handles this automatically; manual Ray actor creation must propagate it via runtime_env.
Data parallelism: Each replica uses replica_rank + seed, producing different but internally reproducible outputs. Two runs with the same config produce bitwise-aligned results.
Multi-turn agent not supported: Full determinism only works for single-turn rollouts (single_turn_agent_loop). Multi-turn rollouts (tool_agent_loop) are not bitwise reproducible, for two reasons:
tool_agent_loopuses a random UUID per trajectory asrequest_idand does not passprioritytoserver_manager.generate(), so the deterministic routing and priority scheduling described above do not apply.Even with those added, each turn is interleaved with external tool calls whose execution time varies across runs, so the order in which requests arrive at vLLM cannot be made deterministic. This is inherent to multi-turn agentic workloads.
For bitwise-reproducible rollouts, use single_turn_agent_loop. (The per-request sampling seed is still applied to multi-turn requests inside the vLLM server, but this alone is not sufficient for end-to-end reproducibility.)
Verifying Determinism
Rollout determinism (bitwise reproducible vLLM generation):
VLLM_DETERMINISM_DENSE_MODEL_PATH=${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct \
VLLM_DETERMINISM_N_GPUS=2 \
pytest tests/workers/rollout/rollout_vllm/test_vllm_generation_determinism.py -v -s
E2E training (bitwise-aligned reward curves across two full PPO runs):
python tests/experimental/reward_loop/run_determinism_e2e_with_rm.py \
--policy_model ~/models/Qwen/Qwen2.5-0.5B-Instruct \
--rm_model ~/models/Skywork/Skywork-Reward-V2-Llama-3.2-1B \
--train_files ~/data/gsm8k/train.parquet \
--val_files ~/data/gsm8k/test.parquet \
--n_gpus 2 --n_steps 5