Full Determinism for Reproducible RL Training

Authors: Haichuan Hu, Yongxiang Huang, Jiawei Zhang, Nguyen Long

Last updated: 06/16/2026.

Overview

By default, RL training in verl is not bitwise reproducible: identical configs run twice can produce different reward curves due to nondeterminism in GPU kernels, request scheduling, hash-based routing, and batch composition. The full determinism feature closes these gaps, enabling two identical runs to produce bitwise-aligned reward curves.

Useful for:

  • Debugging: reproduce a training failure exactly, step-by-step

  • Regression testing: verify that a code change has no silent effect on training outcomes

  • Research: ensure fair comparison when evaluating algorithmic changes

Quick Start

actor_rollout_ref:
  rollout:
    full_determinism: true
    seed: 42
  actor:
    fsdp_config:
      full_determinism: true
  ref:
    fsdp_config:
      full_determinism: true

reward:
  reward_model:
    enable: true
    rollout:
      full_determinism: true
      seed: 42

Or via Hydra overrides:

python -m verl.trainer.main_ppo \
  actor_rollout_ref.rollout.full_determinism=true \
  actor_rollout_ref.rollout.seed=42 \
  actor_rollout_ref.actor.fsdp_config.full_determinism=true \
  actor_rollout_ref.ref.fsdp_config.full_determinism=true \
  reward.reward_model.enable=true \
  reward.reward_model.rollout.full_determinism=true \
  reward.reward_model.rollout.seed=42 \
  [other config overrides...]

Important: PYTHONHASHSEED must be set before the Python interpreter starts. verl handles this automatically — it sets PYTHONHASHSEED from rollout.seed before ray.init() and propagates it to all Ray actors. Do NOT set it manually.

Configuration Reference

Parameter

Default

Scope

Description

actor_rollout_ref.rollout.full_determinism

false

Rollout

Enables deterministic rollout generation

actor_rollout_ref.rollout.seed

42

Rollout

Base seed; each replica uses replica_rank + seed

actor_rollout_ref.actor.fsdp_config.full_determinism

false

Actor

Enables deterministic PyTorch ops for actor

actor_rollout_ref.ref.fsdp_config.full_determinism

false

Ref model

Enables deterministic PyTorch ops for reference model

reward.reward_model.rollout.full_determinism

false

Reward model

Enables deterministic RM inference (forces max_num_seqs=1)

reward.reward_model.rollout.seed

42

Reward model

Base seed for RM vLLM server

How It Works

Determinism is enforced at five layers. All must be enabled for full E2E reproducibility:

PyTorch-level

enable_full_determinism(seed) sets PYTHONHASHSEED, CUBLAS_WORKSPACE_CONFIG, FLASH_ATTENTION_DETERMINISTIC, seeds all RNGs, calls torch.use_deterministic_algorithms(True, warn_only=True), and disables cuDNN benchmarking. Applied in all training engine implementations.

Environment propagation

main_ppo.run_ppo() sets three env vars before ray.init():

  • PYTHONHASHSEED — freezes Python hash() and dict ordering

  • VERL_FULL_DETERMINISM — signals subprocesses to apply determinism

  • VLLM_BATCH_INVARIANT — makes vLLM outputs independent of batch composition

These are forwarded to all Ray actors via PPO_RAY_RUNTIME_ENV.

vLLM batch invariance + per-request seed

VLLM_BATCH_INVARIANT=1 ensures vLLM outputs don’t depend on which other requests are batched together. Each generate() call injects SamplingParams.seed = replica_rank + config.seed to reset RNG per request.

Priority scheduling + deterministic routing

Without determinism, request order depends on arrival timing and server selection on dict iteration order. When full_determinism=true:

  • Each sample gets a globally unique priority injected into the batch (non_tensor_batch["priority"]), so each rollout request is scheduled with a stable, distinct priority

  • SingleTurnAgentLoop uses request_id=f"det-{priority}" instead of random UUID

  • GlobalRequestLoadBalancer tie-breaking uses hash(request_id) % len(candidates) — deterministic with frozen PYTHONHASHSEED

Note: priority is a vLLM-only parameter. LLMServerClient.generate() automatically filters it for non-vLLM backends.

Reward model serialization

vLLM’s /classify endpoint (used by RM) does not support priority or batch invariance. When full_determinism=true, RewardModelManager forces max_num_seqs=1, serializing RM inference one request at a time.

Side Effects and Limitations

Performance: deterministic PyTorch kernels are slower, cuDNN benchmarking is disabled, and RM max_num_seqs=1 causes severe throughput loss. Typical E2E throughput drops 10–30% without RM; RM determinism can drop significantly more.

Recommendation: Only enable for debugging, regression testing, or research. Leave disabled for production training.

Nondeterministic fallbacks: Some GPU ops have no deterministic implementation. torch.use_deterministic_algorithms(True, warn_only=True) warns when these are encountered.

Backend support:

Backend

Rollout

Reward model

vLLM

✅ (serialized)

SGLang

TensorRT-LLM

PYTHONHASHSEED: Must be set before process start. verl handles this automatically; manual Ray actor creation must propagate it via runtime_env.

Data parallelism: Each replica uses replica_rank + seed, producing different but internally reproducible outputs. Two runs with the same config produce bitwise-aligned results.

Multi-turn agent not supported: Full determinism only works for single-turn rollouts (single_turn_agent_loop). Multi-turn rollouts (tool_agent_loop) are not bitwise reproducible, for two reasons:

  • tool_agent_loop uses a random UUID per trajectory as request_id and does not pass priority to server_manager.generate(), so the deterministic routing and priority scheduling described above do not apply.

  • Even with those added, each turn is interleaved with external tool calls whose execution time varies across runs, so the order in which requests arrive at vLLM cannot be made deterministic. This is inherent to multi-turn agentic workloads.

For bitwise-reproducible rollouts, use single_turn_agent_loop. (The per-request sampling seed is still applied to multi-turn requests inside the vLLM server, but this alone is not sufficient for end-to-end reproducibility.)

Verifying Determinism

Rollout determinism (bitwise reproducible vLLM generation):

VLLM_DETERMINISM_DENSE_MODEL_PATH=${HOME}/models/Qwen/Qwen2.5-0.5B-Instruct \
VLLM_DETERMINISM_N_GPUS=2 \
pytest tests/workers/rollout/rollout_vllm/test_vllm_generation_determinism.py -v -s

E2E training (bitwise-aligned reward curves across two full PPO runs):

python tests/experimental/reward_loop/run_determinism_e2e_with_rm.py \
  --policy_model ~/models/Qwen/Qwen2.5-0.5B-Instruct \
  --rm_model ~/models/Skywork/Skywork-Reward-V2-Llama-3.2-1B \
  --train_files ~/data/gsm8k/train.parquet \
  --val_files ~/data/gsm8k/test.parquet \
  --n_gpus 2 --n_steps 5