Megatron Lite backend
Last updated: 06/17/2026.
Megatron Lite (mlite) is Megatron’s experimental, agent-friendly training
path for work that needs to move quickly. It is optimized for fast iteration,
small reviewable changes, and agentic development: model/runtime code can be
changed without touching unrelated Megatron subsystems, and new experiments can
live in their own source checkout instead of being copied into the verl tree.
The verl integration intentionally keeps the backend glue outside this
repository. The mlite checkout provides megatron.lite and the
verl_mlite launcher/config package used by the example scripts here. Put
custom extensions in your own code path, add that path through MLITE_ROOT or
PYTHONPATH, and keep verl focused on orchestration. See the upstream
Megatron Lite path at
NVIDIA/Megatron-LM experimental/lite.
For the dist_opt optimizer path, Megatron Lite is intended to preserve
Megatron-Core behavior rather than trade correctness for flexibility. In
deterministic runs, the mlite path has been validated against the
Megatron-Core distributed optimizer path with bitwise-aligned loss and gradient
norms, and its step time / throughput are also aligned with the Core path.
Install the backend
Clone Megatron-LM’s upstream dev branch and install its Megatron Lite verl
integration:
git clone -b dev https://github.com/NVIDIA/Megatron-LM.git
pip install -e Megatron-LM/experimental/lite/examples/verl
Alternatively, keep the checkout outside the Python environment and set
MLITE_ROOT when running a launcher. The scripts add both
$MLITE_ROOT/experimental/lite and
$MLITE_ROOT/experimental/lite/examples/verl to PYTHONPATH.
Run an example
The DeepSeek-V4 examples use the mlite engine for training and vLLM for
rollout where applicable:
MODEL_PATH=/path/to/deepseek-v4 \
MLITE_ROOT=/path/to/mlite \
OPTIMIZER=fsdp2 \
bash examples/sft/gsm8k/run_deepseek_v4_megatron_lite.sh
MODEL_PATH=/path/to/deepseek-v4 \
MLITE_ROOT=/path/to/mlite \
OPTIMIZER=fsdp2 \
bash examples/grpo_trainer/run_deepseek_v4_megatron_lite.sh
OPTIMIZER accepts dist_opt for the vanilla Megatron distributed
optimizer and fsdp2 for the Megatron Lite FSDP2 wrapper. The DeepSeek-V4
launchers default to a 128-GPU mesh with PP4, EP8, CP4, full activation
recompute, and fsdp2.
Further reading
For a practical discussion of long-sequence MoE RL tuning with Megatron Lite, including memory, recompute, communication overlap, and FSDP2 trade-offs, see Making Long-Context MoE RL Training Easier to Tune.
DeepSeek-V4 DSA note
DeepSeek-V4 uses fused DSA kernels on Hopper and Blackwell GPUs. In addition to
the normal verl runtime, the critical DSA-only dependencies are
nvidia-cutlass-dsl==4.5.2 and nvidia-cudnn-frontend. The
nvidia-cudnn-frontend 1.24.1 release is sufficient for Blackwell, while
Hopper still needs a develop-branch build with IndexerForwardSm90 support.