Megatron Lite backend

Last updated: 06/17/2026.

Megatron Lite (mlite) is Megatron’s experimental, agent-friendly training path for work that needs to move quickly. It is optimized for fast iteration, small reviewable changes, and agentic development: model/runtime code can be changed without touching unrelated Megatron subsystems, and new experiments can live in their own source checkout instead of being copied into the verl tree.

The verl integration intentionally keeps the backend glue outside this repository. The mlite checkout provides megatron.lite and the verl_mlite launcher/config package used by the example scripts here. Put custom extensions in your own code path, add that path through MLITE_ROOT or PYTHONPATH, and keep verl focused on orchestration. See the upstream Megatron Lite path at NVIDIA/Megatron-LM experimental/lite.

For the dist_opt optimizer path, Megatron Lite is intended to preserve Megatron-Core behavior rather than trade correctness for flexibility. In deterministic runs, the mlite path has been validated against the Megatron-Core distributed optimizer path with bitwise-aligned loss and gradient norms, and its step time / throughput are also aligned with the Core path.

Install the backend

Clone Megatron-LM’s upstream dev branch and install its Megatron Lite verl integration:

git clone -b dev https://github.com/NVIDIA/Megatron-LM.git
pip install -e Megatron-LM/experimental/lite/examples/verl

Alternatively, keep the checkout outside the Python environment and set MLITE_ROOT when running a launcher. The scripts add both $MLITE_ROOT/experimental/lite and $MLITE_ROOT/experimental/lite/examples/verl to PYTHONPATH.

Run an example

The DeepSeek-V4 examples use the mlite engine for training and vLLM for rollout where applicable:

MODEL_PATH=/path/to/deepseek-v4 \
MLITE_ROOT=/path/to/mlite \
OPTIMIZER=fsdp2 \
bash examples/sft/gsm8k/run_deepseek_v4_megatron_lite.sh
MODEL_PATH=/path/to/deepseek-v4 \
MLITE_ROOT=/path/to/mlite \
OPTIMIZER=fsdp2 \
bash examples/grpo_trainer/run_deepseek_v4_megatron_lite.sh

OPTIMIZER accepts dist_opt for the vanilla Megatron distributed optimizer and fsdp2 for the Megatron Lite FSDP2 wrapper. The DeepSeek-V4 launchers default to a 128-GPU mesh with PP4, EP8, CP4, full activation recompute, and fsdp2.

Further reading

For a practical discussion of long-sequence MoE RL tuning with Megatron Lite, including memory, recompute, communication overlap, and FSDP2 trade-offs, see Making Long-Context MoE RL Training Easier to Tune.

DeepSeek-V4 DSA note

DeepSeek-V4 uses fused DSA kernels on Hopper and Blackwell GPUs. In addition to the normal verl runtime, the critical DSA-only dependencies are nvidia-cutlass-dsl==4.5.2 and nvidia-cudnn-frontend. The nvidia-cudnn-frontend 1.24.1 release is sufficient for Blackwell, while Hopper still needs a develop-branch build with IndexerForwardSm90 support.