LoongForge: Framework for Multimodal AI Model Training

LoongForge provides a training framework for large-scale transformer models that handle language, vision-language, vision-language-action, and diffusion tasks. Developed by Baidu's Baige AI infrastructure platform, it forms part of the open-source Loong family, which includes LoongFlow. The framework addresses inefficiencies in training pipelines for multimodal and embodied models by building on Megatron-LM with targeted enhancements. It covers pre-training, continued pre-training, and supervised fine-tuning (SFT), while emphasizing scalability across hardware like NVIDIA GPUs and Kunlun XPUs.

This setup targets the challenges of diverse model architectures and hardware setups. Standard frameworks often struggle with memory management, parallelism across components, and migration between GPU types. LoongForge introduces optimizations to cut training costs and speed up development, with 153 GitHub stars as of the latest data. Its initial release came in April 2024 (noted as 2026/04 in announcements, likely a projection or error).

Core features

LoongForge stands out through specific optimizations tailored to multimodal training.

Flexible Composition: Users configure vision-language models (VLMs) by mixing ViT vision encoders and LLM components. This abstraction simplifies adding new multimodal variants without deep code changes.
Heterogeneous Parallelism: Different model parts, like the vision encoder and LLM, get independent settings for tensor parallel size, data parallel size, and recomputation layers. This balances throughput and memory.
Decoupled Encoder-Decoder Training: Vision encoders train separately from LLMs, avoiding pipeline stalls where ViT computation slows LLM forward passes.
DP Load Balancing: A data redistribution algorithm handles imbalances from data packing, boosting efficiency in multi-node setups.
MoE A2A Optimization: For mixture-of-experts (MoE) models, it overlaps All2All communication, activation offloading, and computation, using less memory than base Megatron-LM.

Additional tools include custom fused operators like FusedDSA, which combines flashmla and indexer forward ops with custom backwards for diffusion model training. TileLang-based versions are open-sourced. Adaptive FP8 precision enables end-to-end FP8 training for LLMs and VLMs, auto-enabling it per operator based on GEMM shape for peak performance. Checkpoint conversion supports flexible formats, though details cut off in available docs.

These features natively cover LLMs, VLMs, VLAs, and diffusion models, with high-performance support for NVIDIA GPUs and Kunlun XPUs.

Getting it running

As a Python project under the baidu-baige organization on GitHub, LoongForge requires cloning the repository to start. Access the source at https://github.com/baidu-baige/LoongForge.

Detailed setup lives in the documentation at https://loongforge.readthedocs.io/en/latest/index.html. Expect dependencies on Megatron-LM and libraries for parallelism, such as those for tensor and data parallel training. Hardware setup needs NVIDIA GPUs or Kunlun XPUs, with configurations adjusted per component for optimal scaling.

Basic steps from project patterns:

Clone: git clone https://github.com/baidu-baige/LoongForge.git
Install environment: Use Python 3.x with pip for Megatron-LM and related packages (exact requirements in docs).
Configure: Edit YAML or config files for model composition, parallelism sizes, and hardware targets.
Launch training: Run scripts for pre-training or SFT, specifying stages like decoupled encoder tasks.

A WeChat community link exists for support (https://github.com/baidu-baige/LoongForge/issues/34). The framework licenses under a standard open-source terms (check LICENSE file).

Who this is for

Researchers and engineers training transformer-based models at scale benefit most. If your work involves LLMs, VLMs, VLAs, or diffusion models on clusters with NVIDIA or Kunlun hardware, LoongForge fits. It suits teams needing to pre-train from scratch, continue pre-training, or fine-tune with SFT, especially where multimodal components cause bottlenecks.

Real-world use appears in Baidu's Baige platform for large-scale AI infrastructure. Examples include assembling custom VLMs by swapping ViT encoders, or scaling MoE models with overlapped comms to hit lower memory footprints. Heterogeneous support helps migrate jobs between GPU clusters without retraining tweaks. Smaller teams might use it for proof-of-concept multimodal training before production.

If you run diverse hardware or pack data unevenly across nodes, the load balancing and decoupled training prevent common scaling issues.

Comparisons to alternatives

LoongForge extends Megatron-LM, so it inherits core parallelism but adds multimodal-specific fixes like heterogeneous configs and MoE overlaps. Upstream Megatron-LM lacks native VLM composition and Kunlun XPU support, making LoongForge better for Baidu ecosystems or mixed hardware.

Other frameworks like DeepSpeed handle LLMs well but trail in vision-action modalities. For pure LLMs, Hugging Face Transformers or FairScale offer simpler setups, though they scale less aggressively for VLAs. Axolotl or Llama-Factory focus on fine-tuning with less emphasis on pre-training efficiency. LoongForge's FP8 adaptations and fused ops give it an edge in throughput over vanilla Megatron for diffusion tasks, but it demands more config tuning.

It's heavier than lightweight fine-tuners due to full pipeline support—153 stars reflect early adoption versus Megatron-LM's thousands. No Docker images noted, unlike some rivals.

Aspect	LoongForge	Megatron-LM	DeepSpeed
Modalities	LLM, VLM, VLA, Diffusion	Mostly LLM	LLM-focused
Hardware	NVIDIA + Kunlun XPU	NVIDIA	NVIDIA + others
Key Opt	Heterogeneous parallelism, Adaptive FP8	Core parallelism	ZeRO stages
Ease for Multimodal	High (composition)	Low	Medium

Practical considerations

LoongForge excels where Megatron-LM falls short on multimodal scaling, but it's not for casual fine-tuning or single-GPU hobbyists—expect cluster resources and config familiarity. Check the docs for full checkpoint details and operator builds. Source: https://github.com/baidu-baige/LoongForge. Join via WeChat for updates.

LoongForge: Scalable Training Framework for Multimodal Transformers

Core features

Getting it running

Who this is for

Comparisons to alternatives

Practical considerations

Comments

Core features

Getting it running

Who this is for

Comparisons to alternatives

Practical considerations

Comments

Related Posts

Stable Diffusion WebUI: self-hosted web interface for AI image generation

Byaan: AI data agent that learns your database to answer questions in plain English

ccstory: narrative summaries of Claude Code session logs

Pluck delivers token‑aware file retrieval for AI coding agents to cut costs and latency