Trainer Interfaceο
Trainers drive the training loop. Introducing new trainer classes in case of new training paradiam is encouraged.
Core APIsο
- class verl.trainer.ppo.ray_trainer.RayPPOTrainer(config, tokenizer, role_worker_mapping: dict[~verl.trainer.ppo.ray_trainer.Role, ~typing.Type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~verl.trainer.ppo.ray_trainer.ResourcePoolManager, ray_worker_group_cls: ~verl.single_controller.ray.base.RayWorkerGroup = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None, train_dataset: ~torch.utils.data.dataset.Dataset | None = None, val_dataset: ~torch.utils.data.dataset.Dataset | None = None, collate_fn=None, train_sampler: ~torch.utils.data.sampler.Sampler | None = None, device_name='cuda')ο
- __init__(config, tokenizer, role_worker_mapping: dict[~verl.trainer.ppo.ray_trainer.Role, ~typing.Type[~verl.single_controller.base.worker.Worker]], resource_pool_manager: ~verl.trainer.ppo.ray_trainer.ResourcePoolManager, ray_worker_group_cls: ~verl.single_controller.ray.base.RayWorkerGroup = <class 'verl.single_controller.ray.base.RayWorkerGroup'>, processor=None, reward_fn=None, val_reward_fn=None, train_dataset: ~torch.utils.data.dataset.Dataset | None = None, val_dataset: ~torch.utils.data.dataset.Dataset | None = None, collate_fn=None, train_sampler: ~torch.utils.data.sampler.Sampler | None = None, device_name='cuda')ο
Initialize distributed PPO trainer with Ray backend. Note that this trainer runs on the driver process on a single CPU/GPU node.
- Parameters:
config β Configuration object containing training parameters.
tokenizer β Tokenizer used for encoding and decoding text.
role_worker_mapping (dict[Role, WorkerType]) β Mapping from roles to worker classes.
resource_pool_manager (ResourcePoolManager) β Manager for Ray resource pools.
ray_worker_group_cls (RayWorkerGroup, optional) β Class for Ray worker groups. Defaults to RayWorkerGroup.
processor β Optional data processor, used for multimodal data
reward_fn β Function for computing rewards during training.
val_reward_fn β Function for computing rewards during validation.
train_dataset (Optional[Dataset], optional) β Training dataset. Defaults to None.
val_dataset (Optional[Dataset], optional) β Validation dataset. Defaults to None.
collate_fn β Function to collate data samples into batches.
train_sampler (Optional[Sampler], optional) β Sampler for the training dataset. Defaults to None.
device_name (str, optional) β Device name for training (e.g., βcudaβ, βcpuβ). Defaults to βcudaβ.
- fit()ο
The training loop of PPO. The driver process only need to call the compute functions of the worker group through RPC to construct the PPO dataflow. The light-weight advantage computation is done on the driver process.
- init_workers()ο
Initialize distributed training workers using Ray backend.
Creates: 1. Ray resource pools from configuration 2. Worker groups for each role (actor, critic, etc.)
Utils for tokenization.
- verl.utils.tokenizer.hf_tokenizer(name_or_path, correct_pad_token=True, correct_gemma2=True, **kwargs)ο
Create a huggingface pretrained tokenizer which correctness handles eos and pad tokens.
- Parameters:
name (str) β The name of the tokenizer.
correct_pad_token (bool) β Whether to correct the pad token id.
correct_gemma2 (bool) β Whether to correct the gemma2 tokenizer.
- Returns:
The pretrained tokenizer.
- Return type:
transformers.PreTrainedTokenizer
Core functions to implement PPO algorithms. The function implemented in this file should be used by trainer with different distributed strategies to implement PPO-like algorithms.
- verl.trainer.ppo.core_algos.agg_loss(loss_mat: Tensor, loss_mask: Tensor, loss_agg_mode: str)ο
Aggregate the loss matrix into a scalar.
- Parameters:
loss_mat β (torch.Tensor): shape: (bs, response_length)
loss_mask β (torch.Tensor): shape: (bs, response_length)
loss_agg_mode β (str) choices: method to aggregate the loss matrix into a scalar.
- Returns:
- a scalar torch.Tensor
aggregated loss
- Return type:
loss
- verl.trainer.ppo.core_algos.compute_policy_loss(old_log_prob, log_prob, advantages, response_mask, cliprange=None, cliprange_low=None, cliprange_high=None, clip_ratio_c=3.0, loss_agg_mode: str = 'token-mean')ο
Compute the clipped policy objective and related metrics for PPO.
Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122
- Parameters:
old_log_prob (torch.Tensor) β Log-probabilities of actions under the old policy, shape (batch_size, response_length).
log_prob (torch.Tensor) β Log-probabilities of actions under the current policy, shape (batch_size, response_length).
advantages (torch.Tensor) β Advantage estimates for each action, shape (batch_size, response_length).
response_mask (torch.Tensor) β Mask indicating which tokens to include in the loss, shape (batch_size, response_length).
cliprange (float, optional) β Clipping parameter Ξ΅ for standard PPO. See https://arxiv.org/abs/1707.06347. Defaults to None (must be provided).
cliprange_low (float, optional) β Lower clip range for dual-clip PPO. Defaults to same as cliprange.
cliprange_high (float, optional) β Upper clip range for dual-clip PPO. Defaults to same as cliprange.
clip_ratio_c (float, optional) β Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. Defaults to 3.0.
loss_agg_mode (str, optional) β Aggregation mode for agg_loss. Defaults to βtoken-meanβ.
- verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensorο
Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104 See more description in http://joschu.net/blog/kl-approx.html
- Parameters:
logprob
ref_logprob
Returns:
- verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensorο
Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104 See more description in http://joschu.net/blog/kl-approx.html
- Parameters:
logprob
ref_logprob
Returns:
- verl.trainer.ppo.reward.compute_reward(data: DataProto, reward_fn)ο
Compute reward for a batch of data. :param data: DataProto object containing the input data. :param reward_fn: Reward function to compute the reward.
- Returns:
Tuple of reward tensor and extra info dictionary.
- verl.trainer.ppo.reward.load_reward_manager(config, tokenizer, num_examine, **reward_kwargs)ο
Load and initialize a reward manager based on the configuration.
- Parameters:
config β PPO trainer configuration object containing reward_model fields.
tokenizer β Tokenizer object used for processing text.
num_examine β Number of samples to examine.
**reward_kwargs β Additional keyword arguments for the reward manager.
- Returns:
An instance of the specified reward manager class.
- class verl.workers.reward_manager.NaiveRewardManager(tokenizer, num_examine, compute_score=None, reward_fn_key='data_source')ο
The reward manager.
- class verl.workers.reward_manager.DAPORewardManager(tokenizer, num_examine, compute_score=None, reward_fn_key='data_source', max_resp_len=None, overlong_buffer_cfg=None)ο
The reward manager.