Trainer Interface

Trainers drive the training loop. Introducing new trainer classes in case of new training paradiam is encouraged.

Core APIs

Utils for tokenization.

verl.utils.tokenizer.hf_tokenizer(name_or_path, correct_pad_token=True, correct_gemma2=True, **kwargs)

Create a huggingface pretrained tokenizer which correctness handles eos and pad tokens.

Parameters:
  • name (str) – The name of the tokenizer.

  • correct_pad_token (bool) – Whether to correct the pad token id.

  • correct_gemma2 (bool) – Whether to correct the gemma2 tokenizer.

Returns:

The pretrained tokenizer.

Return type:

transformers.PreTrainedTokenizer

Core functions to implement PPO algorithms. The function implemented in this file should be used by trainer with different distributed strategies to implement PPO

verl.trainer.ppo.core_algos.agg_loss(loss_mat: Tensor, loss_mask: Tensor, loss_agg_mode: str)

Aggregate the loss matrix into a scalar.

Parameters:
  • loss_mat(torch.Tensor): shape: (bs, response_length)

  • loss_mask(torch.Tensor): shape: (bs, response_length)

  • loss_agg_mode – (str) choices: method to aggregate the loss matrix into a scalar.

Returns:

a scalar torch.Tensor

aggregated loss

Return type:

loss

verl.trainer.ppo.core_algos.compute_policy_loss(old_log_prob, log_prob, advantages, response_mask, cliprange=None, cliprange_low=None, cliprange_high=None, clip_ratio_c=3.0, loss_agg_mode: str = 'token-mean')

Compute the clipped policy objective and related metrics for PPO.

Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122

Parameters:
  • old_log_prob (torch.Tensor) – Log-probabilities of actions under the old policy, shape (batch_size, response_length).

  • log_prob (torch.Tensor) – Log-probabilities of actions under the current policy, shape (batch_size, response_length).

  • advantages (torch.Tensor) – Advantage estimates for each action, shape (batch_size, response_length).

  • response_mask (torch.Tensor) – Mask indicating which tokens to include in the loss, shape (batch_size, response_length).

  • cliprange (float, optional) – Clipping parameter ε for standard PPO. See https://arxiv.org/abs/1707.06347. Defaults to None (must be provided).

  • cliprange_low (float, optional) – Lower clip range for dual-clip PPO. Defaults to same as cliprange.

  • cliprange_high (float, optional) – Upper clip range for dual-clip PPO. Defaults to same as cliprange.

  • clip_ratio_c (float, optional) – Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. Defaults to 3.0.

  • loss_agg_mode (str, optional) – Aggregation mode for agg_loss. Defaults to “token-mean”.

verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor

Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104 See more description in http://joschu.net/blog/kl-approx.html

Parameters:
  • logprob

  • ref_logprob

Returns:

verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor

Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104 See more description in http://joschu.net/blog/kl-approx.html

Parameters:
  • logprob

  • ref_logprob

Returns:

verl.trainer.ppo.reward.compute_reward(data: DataProto, reward_fn)

Compute reward for a batch of data. :param data: DataProto object containing the input data. :param reward_fn: Reward function to compute the reward.

Returns:

Tuple of reward tensor and extra info dictionary.