nanoVLM: The Super Simple Vision-Language Model You Can Train Yourself

Ever wished you could train your own AI that understands both images and text? Meet nanoVLM – the easiest way to build and train a Vision-Language Model using plain PyTorch code that won’t make your head spin.
What’s nanoVLM All About?
Think of nanoVLM as the friendly beginner’s toolkit for vision-language models. The entire codebase is just 750 lines (plus some extra stuff for logging). The code is crystal clear and broken down into simple parts:
- Vision Backbone (~150 lines)
- Language Decoder (~250 lines)
- Modality Projection (~50 lines)
- The VLM itself (~100 lines)
- Training loop (~200 lines)
Just like Andrej Karpathy’s popular nanoGPT, this project aims to make cutting-edge AI more accessible. It’s not trying to break records – it’s trying to break barriers for newcomers.
What Can It Actually Do?
Using pre-trained parts (SigLIP-B/16-224-85M and HuggingFaceTB/SmolLM2-135M), you get a 222M parameter model. Train it for about 6 hours on a single H100 GPU with 1.7M training samples, and it scores 35.3% accuracy on the MMStar benchmark.
This makes it perfect for playing around with VLMs without needing massive resources or complicated setups.
Getting Started
You can either clone the repository and set up locally, or use Google Colab for a no-setup experience.
Setting Up Your Environment
First, grab the code:
1 2 3 |
git clone https://github.com/huggingface/nanoVLM.git cd nanoVLM |
Using uv (recommended):
1 2 3 4 5 |
uv init --bare uv sync --python 3.12 source .venv/bin/activate uv add torch numpy torchvision pillow datasets huggingface-hub transformers wandb |
Or with pip:
1 2 |
pip install torch numpy torchvision pillow datasets huggingface-hub transformers wandb |
Training Your Model
Starting the training is super easy:
1 2 3 |
wandb login --relogin python train.py |
Generating Results
Once trained, try it out:
1 2 |
python generate.py |
For example, when shown a picture of a cat and asked “What is this?”, the model generates responses like:
- “This is a cat sitting on the floor.”
- “The picture contains a white and brown cat sitting on the floor.”
Using Pre-trained Models
Loading a pre-trained model from Hugging Face Hub:
1 2 3 |
from models.vision_language_model import VisionLanguageModel model = VisionLanguageModel.from_pretrained("lusxvr/nanoVLM-222M") |
Sharing Your Work
After training, share your model on Hugging Face Hub:
1 2 |
model.push_to_hub("my-awesome-nanovlm-model") |
Or save it locally:
1 2 |
model.save_pretrained("path/to/local/model") |
Ready to jump in? Whether you’re a student, researcher, or just AI-curious, nanoVLM gives you a simple starting point for understanding how machines can see and talk about images. Give it a try and see what your model can learn to recognize!