Megatron Lite backend ===================== Last updated: 06/17/2026. Megatron Lite (``mlite``) is Megatron's experimental, agent-friendly training path for work that needs to move quickly. It is optimized for fast iteration, small reviewable changes, and agentic development: model/runtime code can be changed without touching unrelated Megatron subsystems, and new experiments can live in their own source checkout instead of being copied into the verl tree. The verl integration intentionally keeps the backend glue outside this repository. The ``mlite`` checkout provides ``megatron.lite`` and the ``verl_mlite`` launcher/config package used by the example scripts here. Put custom extensions in your own code path, add that path through ``MLITE_ROOT`` or ``PYTHONPATH``, and keep verl focused on orchestration. See the upstream Megatron Lite path at `NVIDIA/Megatron-LM experimental/lite `_. For the ``dist_opt`` optimizer path, Megatron Lite is intended to preserve Megatron-Core behavior rather than trade correctness for flexibility. In deterministic runs, the ``mlite`` path has been validated against the Megatron-Core distributed optimizer path with bitwise-aligned loss and gradient norms, and its step time / throughput are also aligned with the Core path. Install the backend ------------------- Clone Megatron-LM's upstream ``dev`` branch and install its Megatron Lite verl integration: .. code-block:: bash git clone -b dev https://github.com/NVIDIA/Megatron-LM.git pip install -e Megatron-LM/experimental/lite/examples/verl Alternatively, keep the checkout outside the Python environment and set ``MLITE_ROOT`` when running a launcher. The scripts add both ``$MLITE_ROOT/experimental/lite`` and ``$MLITE_ROOT/experimental/lite/examples/verl`` to ``PYTHONPATH``. Run an example -------------- The DeepSeek-V4 examples use the ``mlite`` engine for training and vLLM for rollout where applicable: .. code-block:: bash MODEL_PATH=/path/to/deepseek-v4 \ MLITE_ROOT=/path/to/mlite \ OPTIMIZER=fsdp2 \ bash examples/sft/gsm8k/run_deepseek_v4_megatron_lite.sh .. code-block:: bash MODEL_PATH=/path/to/deepseek-v4 \ MLITE_ROOT=/path/to/mlite \ OPTIMIZER=fsdp2 \ bash examples/grpo_trainer/run_deepseek_v4_megatron_lite.sh ``OPTIMIZER`` accepts ``dist_opt`` for the vanilla Megatron distributed optimizer and ``fsdp2`` for the Megatron Lite FSDP2 wrapper. The DeepSeek-V4 launchers default to a 128-GPU mesh with PP4, EP8, CP4, full activation recompute, and ``fsdp2``. Further reading --------------- For a practical discussion of long-sequence MoE RL tuning with Megatron Lite, including memory, recompute, communication overlap, and FSDP2 trade-offs, see `Making Long-Context MoE RL Training Easier to Tune `_. DeepSeek-V4 DSA note -------------------- DeepSeek-V4 uses fused DSA kernels on Hopper and Blackwell GPUs. In addition to the normal verl runtime, the critical DSA-only dependencies are ``nvidia-cutlass-dsl==4.5.2`` and ``nvidia-cudnn-frontend``. The ``nvidia-cudnn-frontend`` 1.24.1 release is sufficient for Blackwell, while Hopper still needs a develop-branch build with ``IndexerForwardSm90`` support.