Using Checkpoints to Support Fault Tolerance Trainingο
There could be training errors or machine failure during the whole RLHF training process, so it is recommended to enable checkpoints to minimize your loss.
The API Interface has already been listed in Config Explanation, and we will not repeat them. But there are still some technique details we hope to clarify.
Note
Notice that the checkpoint.contents
field has no effect to FSDP checkpoint except hf_model
,
the other 3 fields are binded together to save and load. We recommend to include model
, optimizer
and extra
all.
Checkpoint Saving Directory Structureο
Commonly, we use the default_local_dir
declared in ppo_trainer.yaml
or ppo_megatron_trainer.yml
to work as preffix when saving checkpoints, which is checkpoints/${trainer.project_name}/${trainer.experiment_name}
.
So the inner checkpoint structure of FSDP is like:
checkpoints/${trainer.project_name}/${trainer.experiment_name}
βββ global_steps_${i}
β βββ actor
β β βββ model_world_size_{self.world_size}_rank_{self.rank}.pt
β β βββ optim_world_size_{self.world_size}_rank_{self.rank}.pt
β β βββ extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
β βββ actor_huggingface
β βββ critic
β β βββ model_world_size_{self.world_size}_rank_{self.rank}.pt
β β βββ optim_world_size_{self.world_size}_rank_{self.rank}.pt
β β βββ extra_state_world_size_{self.world_size}_rank_{self.rank}.pt
β βββ critic_huggingface
βββ latest_checkpointed_iteration.txt
All model shards, optimizers and extra states are stored together, in a sharded and distributed way.
While Megatron current checkpoint structure is:
checkpoints/${trainer.project_name}/${trainer.experiment_name}
βββ global_steps_${i}
β βββ actor
β β βββ huggingface # default save tokenizer, save huggingface model if include ``hf_mode`` in checkpoint.contents
β β βββ model # save sharded model, naming the same as Megatron
β β β βββ mp_rank_xx_yyy # xx is tp_rank in 2 digits, yyy is pp_rank in 3 digits
β β β β βββ model_states.pt
β β β βββ mp_rank_xx_xxx
β β βββ optim
β β β βββ distrib_optim_pp{a}_tp{b}_cp{c}_dp{d}.pt
β β βββ rng_states
β βββ critic
β β βββ huggingface
β β βββ model
β β βββ optim
β β βββ rng_states
βββ latest_checkpointed_iteration.txt
Convert FSDP and Megatron Checkpoints to HuggingFace Format Modelο
We provide a tool to convert the FSDP and Megatron checkpoints to HuggingFace format model.
The tool is located in scripts/model_merger.py
.
The script supports two main sub-commands: merge (to convert and save checkpoints) and test (to validate merged checkpoints against a reference model). The arguments for the merge sub-command are as follows:
usage: model_merger.py merge [-h] --backend {fsdp,megatron} --local_dir LOCAL_DIR [--hf_model_path HF_MODEL_PATH]
[--tie-word-embedding] [--is-value-model] [--target_dir TARGET_DIR]
[--hf_upload_path HF_UPLOAD_PATH] [--private]
options:
-h, --help show this help message and exit
--backend {fsdp,megatron}
The backend of the model
--local_dir LOCAL_DIR
Path to the saved model checkpoints
--hf_model_path HF_MODEL_PATH
(Deprecated) Path to the original Hugging Face model for config.
--tie-word-embedding Whether to tie word embedding weights (currently only Megatron supported)
--is-value-model Whether the model is a value model (currently only Megatron supported)
--target_dir TARGET_DIR
Directory to save the merged huggingface model
--hf_upload_path HF_UPLOAD_PATH
Hugging Face repository ID to upload the model
--private Whether to upload the model to a private Hugging Face repository
Example usage for merging Megatron checkpoints:
python scripts/model_merger.py merge \
--backend megatron \
--tie-word-embedding \
--local_dir checkpoints/verl_megatron_gsm8k_examples/qwen2_5_0b5_megatron_saveload/global_step_1/actor \
--target_dir /path/to/merged_hf_model
Example usage for merging FSDP checkpoints:
python scripts/model_merger.py merge \
--backend fsdp \
--local_dir checkpoints/verl_fsdp_gsm8k_examples/qwen2_5_0b5_fsdp_saveload/global_step_1/actor \
--target_dir /path/to/merged_hf_model
Megatron Merger detailsο
Current implement of decoder layers uses nn.ModuleList
to store the layers,
and thus the model layers on every PP rank and VPP rank starts their index from 0.
There are 3 ways to correct this behavior:
Modify the decoder layerβs state_dict, add
offset
to each layerβs index, thus rewritenn.ModuleList
implementation.Modify the layer index when saving checkpoint and recover them when loading checkpoint.
The Checkpoint merger do this work, calculate the actual
offset
fromstate_dict
only, a little complex.
Current implementation use solution 2.
HuggingFace to Megatron DistCheckpoint detailsο
If your model is quite huge, we recommend you to use Megatron dist-checkpoint to load the model. Megatron dist-checkpoint supports loading with different kinds of model parallelism, and it is much faster than the original checkpoint loading.
To convert original HuggingFace model to Megatron dist-checkpoint,
you can use the scripts/converter_hf_to_mcore.py
script. Large MoE models are temporarily supported with CPU initialization,
which is a little slower. While we are working on a better solution to support large models.
Example command to convert the model is as follows:
python scripts/converter_hf_to_mcore.py \
--hf_model_path Qwen/Qwen1.5-MoE-A2.7B-Chat \
--output_path /mnt/disk/Qwen/Qwen1.5-MoE-A2.7B-Chat \
--use_cpu_initialization # Only work for MoE models
Original Checkpoint Utilsο
Original Checkpoint Utils refer to original checkpoint implementation in verl/models/[model]/megatron/checkpoint_utils
.
We only need [model]_loader.py
in original checkpoint utils now, since we get rid of storing hf_model
every time (which is not recommended for large model training, try only saving sharded models if you can).
Note
Note that [model]_loader
only support environments where storage clusters are able to connect with every calculation nodes.
Because it utilizes sharded load way to minimize the loading checkpoint overhead.
Every rank loads its own data from state_dict
which can be accessed by all of them.
While there is also no need to broadcast among DP ranks, since the saved state_dict is only produced by DP rank 0.
For users who can only place the huggingface model on one device, we keep the original costly implementation in [model]_loader_deprecated
. In this implementation, rank 0 broadcast all weights to each tp and pp rank, and then dp rank 0 broadcast to all dp ranks. There may be at risks of OOM.
To use deprecated loader, change the import package of load_state_dict_to_megatron_llama
.