Add models with the Megatron-LM backend
Model
If use latest verl, we have direct support of GPTModel
for Megatron backend.
You can use the similar way of using Megatron to pretrain custom models.
We list the steps here:
Find model_initializer.py
If your model is configurable by
TransformerLayerSpec
, you can directly useGPTModel
. Otherwise, Please implement a newModelLayerSpec
andModelLayer
here.Use the right
LayerSpec
,TransformerConfig
andHuggingfaceConfig
as arguments to initialize the GPTModel.Return the model at last.
Add Models with old version of verl
The most challenging aspect to use the Megatron-LM backend is implementing the models for training. Currently, we implement Llama model that support data parallelism, tensor parallelism, pipeline parallelism (also vPP) and sequence parallelism. We also implement remove padding (sequence packing) on Llama model, which can be found in modeling_llama_megatron.py.
To support other model, users are required to implement:
Implemnt a model similar to
modeling_llama_megatron.py
that satisfy the parallelism requirements of Megatron-LM. Then register your model in the registry.py.Checkpoint utils that can load full checkpoint (e.g. huggingface checkpoint) to partitioned models during the runtime. Then register your loader to
weight_loader_registry
in weight_loader_registry.py.Weight loader that synchronize the weight from Megatron to rollout (vLLM) model. Note that both the actor model and rollout model are partitioned during runtime. So, it’s advisable to map the model name in actor model implementation. Otherwise, you may need an additional name mapping and even weight transformation. The weight loader implementation is in megatron_weight_loaders.py.