Artificial IntelligenceInfo

nanoVLM: The Super Simple Vision-Language Model You Can Train Yourself

Ever wished you could train your own AI that understands both images and text? Meet nanoVLM – the easiest way to build and train a Vision-Language Model using plain PyTorch code that won’t make your head spin.


What’s nanoVLM All About?

Think of nanoVLM as the friendly beginner’s toolkit for vision-language models. The entire codebase is just 750 lines (plus some extra stuff for logging). The code is crystal clear and broken down into simple parts:

  • Vision Backbone (~150 lines)
  • Language Decoder (~250 lines)
  • Modality Projection (~50 lines)
  • The VLM itself (~100 lines)
  • Training loop (~200 lines)

Just like Andrej Karpathy’s popular nanoGPT, this project aims to make cutting-edge AI more accessible. It’s not trying to break records – it’s trying to break barriers for newcomers.

What Can It Actually Do?

Using pre-trained parts (SigLIP-B/16-224-85M and HuggingFaceTB/SmolLM2-135M), you get a 222M parameter model. Train it for about 6 hours on a single H100 GPU with 1.7M training samples, and it scores 35.3% accuracy on the MMStar benchmark.

This makes it perfect for playing around with VLMs without needing massive resources or complicated setups.

Getting Started

You can either clone the repository and set up locally, or use Google Colab for a no-setup experience.

Setting Up Your Environment

First, grab the code:

Using uv (recommended):

Or with pip:

Training Your Model

Starting the training is super easy:

Generating Results

Once trained, try it out:

For example, when shown a picture of a cat and asked “What is this?”, the model generates responses like:

  • “This is a cat sitting on the floor.”
  • “The picture contains a white and brown cat sitting on the floor.”

Using Pre-trained Models

Loading a pre-trained model from Hugging Face Hub:

Sharing Your Work

After training, share your model on Hugging Face Hub:

Or save it locally:


Ready to jump in? Whether you’re a student, researcher, or just AI-curious, nanoVLM gives you a simple starting point for understanding how machines can see and talk about images. Give it a try and see what your model can learn to recognize!

fdciabdul

Nothing more important except trains youself become better

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button