Implement Reward Function for Dataset
For each dataset, we need to implement a reward function or utilize a reward model to compute the rewards for the generated responses. We already pre-implemented some reward functions in reward_score directory. You can also use customized reward functions.
Currently, we support reward functions for GSM8k and MATH datasets. For RLHF datasets (e.g., full_hh_rlhf) and Code Generation (e.g., APPS), we utilize reward model and SandBox (will opensource soon) for evaluation respectively.
RewardManager
In the entrypoint of the PPO Post-Training script main_ppo.py,
we implement a RewardManager
that utilze pre-implemented reward functions to compute the scores for each response.
In the RewardManager
, we implemented a __call__
function to
compute the score for each response.
All the reward functions are executed by compute_score_fn
.
The input is a DataProto
, which includes:
input_ids
,attention_mask
:input_ids
andattention_mask
after applying chat_template, including prompt and responseresponses
: response tokensground_truth
: The ground truth string of the current prompt. Stored innon_tensor_batch
in theDataProto
, which should be preprocessed in the parquet files.data_source
: The dataset name of the current prompt. Stored innon_tensor_batch
in theDataProto
, which should be preprocessed in the parquet files.
After detokenize the responses, the responses string and the ground
truth string will be input to the compute_score_fn
to compute the
score for each response.
Reward Functions
Pre-implemented
We already pre-implemented some reward functions in reward_score directory.
In the GSM8k example, we force the response to output the final answer after four ####, then use string matching to compare with the ground truth. If completely correct, score 1 point; if the format is correct, score 0.1 points; if the format is incorrect, score 0 points.
In the MATH example, we follow the implementation in lm-evaluation-harness repository.
Customized
You can implement customized reward functions in a separate file and specify them using custom_reward_function.path
and custom_reward_function.name
. For the set of them, please refer to Config Explanation.
The parameters of your reward function should be data_source
, solution_str
, ground_truth
, and extra_info
.
For example:
def my_reward_fn(data_source, solution_str, ground_truth, extra_info=None):
return len(solution_str)/100
If you are testing only a single customized reward function, you can simply name it ‘compute_score’ and leave custom_reward_function.name
unset.
To run multiple tests with different customized reward functions, you can modify both custom_reward_function.path
and custom_reward_function.name
for each trial.
For instance, you might create a single my_reward.py file and implement multiple reward functions within it. This way, for different trials, you only need to adjust custom_reward_function.name
, making it more convenient to conduct multiple tests within scripts.