Algorithm Baselines

Math related datasets

Assuming GSM8k/math dataset is preprocessed via:

python3 examples/data_preprocess/*.py

Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.

Hardware	Model	Method	Test score	Details
NVIDIA GPU	google/gemma-2-2b-it	hf checkpoint	23.9	Huggingface
NVIDIA GPU	google/gemma-2-2b-it	SFT	52.06	command and logs
NVIDIA GPU	google/gemma-2-2b-it	SFT + PPO	64.02	command and logs, wandb
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	hf checkpoint	36.4	Qwen blog
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	PPO	56.7	command and log
NVIDIA GPU	Qwen/Qwen2.5-0.5B-Instruct	PRIME	58.7	script, wandb
NVIDIA GPU	deepseek-ai/deepseek-llm-7b-chat	PPO (Megatron)	69.5 [1]	log, wandb
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO	89	script
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (FSDP2)	89.8	log
NVIDIA GPU	Qwen/Qwen2-7B-Instruct	GRPO (Megatron)	89.6	log
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	ReMax	97	script, wandb
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	SPPO	65.6 (MATH)	SPPO script
NVIDIA GPU	Mixtral-8x22B-Instruct-v0.1	Instruct model	83.7	Qwen Blog
NVIDIA GPU	Mixtral-8x22B-Instruct-v0.1	RLOO (Megatron)	92.3	wandb
NVIDIA GPU	Qwen/Qwen2.5-7B-Instruct	SPIN	92	script
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	PPO	70.5 [1]	log
AMD MI300	deepseek-ai/deepseek-llm-7b-chat	GRPO	71.4 [1]	log

Coding related datasets

Below is the result on leetcode if not specified otherwise.

Hardware	Model	Method	Test score	Details
NVIDIA GPU	PRIME-RL/Eurus-2-7B-SFT	RPIME	36.1	script, swanlab

Notes

[1] During evaluation, we have only extracted answers following the format "####". A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.

[2] The default value of actor_rollout_ref.actor.entropy_coeff is set to 0.0 since verl 0.3.x on 2025-05-30, which is different from previous versions.