Trainer Interface
Trainers drive the training loop. Introducing new trainer classes in case of new training paradiam is encouraged.
Core APIs
Utils for tokenization.
- verl.utils.tokenizer.hf_tokenizer(name_or_path, correct_pad_token=True, correct_gemma2=True, **kwargs)
Create a huggingface pretrained tokenizer which correctness handles eos and pad tokens.
- Parameters:
name (str) – The name of the tokenizer.
correct_pad_token (bool) – Whether to correct the pad token id.
correct_gemma2 (bool) – Whether to correct the gemma2 tokenizer.
- Returns:
The pretrained tokenizer.
- Return type:
transformers.PreTrainedTokenizer
Core functions to implement PPO algorithms. The function implemented in this file should be used by trainer with different distributed strategies to implement PPO
- verl.trainer.ppo.core_algos.agg_loss(loss_mat: Tensor, loss_mask: Tensor, loss_agg_mode: str)
Aggregate the loss matrix into a scalar.
- Parameters:
loss_mat – (torch.Tensor): shape: (bs, response_length)
loss_mask – (torch.Tensor): shape: (bs, response_length)
loss_agg_mode – (str) choices: method to aggregate the loss matrix into a scalar.
- Returns:
- a scalar torch.Tensor
aggregated loss
- Return type:
loss
- verl.trainer.ppo.core_algos.compute_policy_loss(old_log_prob, log_prob, advantages, response_mask, cliprange=None, cliprange_low=None, cliprange_high=None, clip_ratio_c=3.0, loss_agg_mode: str = 'token-mean')
Compute the clipped policy objective and related metrics for PPO.
Adapted from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1122
- Parameters:
old_log_prob (torch.Tensor) – Log-probabilities of actions under the old policy, shape (batch_size, response_length).
log_prob (torch.Tensor) – Log-probabilities of actions under the current policy, shape (batch_size, response_length).
advantages (torch.Tensor) – Advantage estimates for each action, shape (batch_size, response_length).
response_mask (torch.Tensor) – Mask indicating which tokens to include in the loss, shape (batch_size, response_length).
cliprange (float, optional) – Clipping parameter ε for standard PPO. See https://arxiv.org/abs/1707.06347. Defaults to None (must be provided).
cliprange_low (float, optional) – Lower clip range for dual-clip PPO. Defaults to same as cliprange.
cliprange_high (float, optional) – Upper clip range for dual-clip PPO. Defaults to same as cliprange.
clip_ratio_c (float, optional) – Lower bound of the ratio for dual-clip PPO. See https://arxiv.org/pdf/1912.09729. Defaults to 3.0.
loss_agg_mode (str, optional) – Aggregation mode for agg_loss. Defaults to “token-mean”.
- verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor
Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104 See more description in http://joschu.net/blog/kl-approx.html
- Parameters:
logprob
ref_logprob
Returns:
- verl.trainer.ppo.core_algos.kl_penalty(logprob: FloatTensor, ref_logprob: FloatTensor, kl_penalty) FloatTensor
Compute KL divergence given logprob and ref_logprob. Copied from https://github.com/huggingface/trl/blob/main/trl/trainer/ppo_trainer.py#L1104 See more description in http://joschu.net/blog/kl-approx.html
- Parameters:
logprob
ref_logprob
Returns: