nanoVLM: The Super Simple Vision-Language Model You Can Train Yourself

fdciabdulMay 9, 2025

0 521 1 minute read

Ever wished you could train your own AI that understands both images and text? Meet nanoVLM – the easiest way to build and train a Vision-Language Model using plain PyTorch code that won’t make your head spin.

What’s nanoVLM All About?

Think of nanoVLM as the friendly beginner’s toolkit for vision-language models. The entire codebase is just 750 lines (plus some extra stuff for logging). The code is crystal clear and broken down into simple parts:

Vision Backbone (~150 lines)
Language Decoder (~250 lines)
Modality Projection (~50 lines)
The VLM itself (~100 lines)
Training loop (~200 lines)

Just like Andrej Karpathy’s popular nanoGPT, this project aims to make cutting-edge AI more accessible. It’s not trying to break records – it’s trying to break barriers for newcomers.

What Can It Actually Do?

Using pre-trained parts (SigLIP-B/16-224-85M and HuggingFaceTB/SmolLM2-135M), you get a 222M parameter model. Train it for about 6 hours on a single H100 GPU with 1.7M training samples, and it scores 35.3% accuracy on the MMStar benchmark.

This makes it perfect for playing around with VLMs without needing massive resources or complicated setups.

Getting Started

You can either clone the repository and set up locally, or use Google Colab for a no-setup experience.

Setting Up Your Environment

First, grab the code:

git clone https://github.com/huggingface/nanoVLM.git
cd nanoVLM

git clone https://github.com/huggingface/nanoVLM.git

cd nanoVLM

Using uv (recommended):

uv init --bare
uv sync --python 3.12
source .venv/bin/activate
uv add torch numpy torchvision pillow datasets huggingface-hub transformers wandb

uv init --bare

uv sync --python 3.12

source .venv/bin/activate

uv add torch numpy torchvision pillow datasets huggingface-hub transformers wandb

Or with pip:

pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb

1 2	pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb

Training Your Model

Starting the training is super easy:

wandb login --relogin
python train.py

wandb login --relogin

python train.py

Generating Results

Once trained, try it out:

python generate.py

1 2	python generate.py

For example, when shown a picture of a cat and asked “What is this?”, the model generates responses like:

“This is a cat sitting on the floor.”
“The picture contains a white and brown cat sitting on the floor.”

Using Pre-trained Models

Loading a pre-trained model from Hugging Face Hub:

from models.vision_language_model import VisionLanguageModel
model = VisionLanguageModel.from_pretrained("lusxvr/nanoVLM-222M")

from models.vision_language_model import VisionLanguageModel

model = VisionLanguageModel.from_pretrained("lusxvr/nanoVLM-222M")

Sharing Your Work

After training, share your model on Hugging Face Hub:

model.push_to_hub("my-awesome-nanovlm-model")

1 2	model.push_to_hub("my-awesome-nanovlm-model")

Or save it locally:

model.save_pretrained("path/to/local/model")

1 2	model.save_pretrained("path/to/local/model")

Ready to jump in? Whether you’re a student, researcher, or just AI-curious, nanoVLM gives you a simple starting point for understanding how machines can see and talk about images. Give it a try and see what your model can learn to recognize!