Multimodal models — LLaVA, GPT-4V-class architectures, Flamingo derivatives, CLIP variants — process text, images, video, and audio simultaneously. Training them is more GPU-intensive than unimodal LLMs because you're encoding multiple data types through separate towers and fusing them through cross-attention or projection layers. The data pipeline alone is 3-5x more complex.

Here's what the GPU requirements actually look like, from small-scale fine-tuning to pre-training.

Why Multimodal Needs More GPU

A text-only 7B LLM processes tokenized sequences. Straightforward. A multimodal model of equivalent capability does all of that plus:

  • Encodes images through a vision transformer (ViT-L is 300M params, consuming 4-8GB VRAM for the encoder alone)
  • Processes variable-resolution images (higher res = more patches = more memory)
  • Maintains a cross-modal projection layer that bridges visual features to the LLM's embedding space
  • Handles mixed-length batches where some examples have images and others don't

The memory overhead versus a text-only model of the same LLM backbone: roughly 40-60% more VRAM during training.

GPU Requirements by Task

Fine-tuning a multimodal model (e.g., LLaVA 1.5 on custom data):

Model SizeMinimum GPURecommendedCost on io.net
LLaVA 7B (LoRA)RTX 4090 (24GB)A100 40GB$0.18-$1.20/hr
LLaVA 13B (LoRA)A100 40GBA100 80GB$1.20-$1.49/hr
LLaVA 34B (LoRA)A100 80GB2x A100 80GB$1.49-$2.98/hr

Fine-tuning LLaVA 7B on 50K image-text pairs with LoRA: approximately 6 hours on a single A100 40GB, costing about $7.20 on io.net.

Pre-training a multimodal model from scratch:

This is where things get expensive. Even a relatively small multimodal pre-training run requires:

ScaleGPUsDurationCost on io.net
Small (1B vision + 7B LLM)8x A100 80GB3-7 days$860-$2,000
Medium (1B vision + 13B LLM)16x A100 80GB7-14 days$4,000-$8,000
Large (2B vision + 70B LLM)64x H100 SXM14-30 days$47,000-$101,000

For context, Meta's LLaVA-NeXT was trained on 32x A100 80GB for several days. Qwen-VL used 32x A100s. These are research-lab-scale runs that are increasingly achievable on io.net at a fraction of hyperscaler pricing.

Data Pipeline Considerations

Multimodal training is almost always bottlenecked by the data pipeline, not GPU compute. Images are large, decoding is CPU-intensive, and augmentations add latency.

Storage: Image-text datasets are big. A 10M image-text pair dataset (like a subset of LAION) is 5-20TB. Use NVMe-backed persistent volumes on io.net with pre-processed WebDataset shards for maximum throughput.

Preprocessing: Decode and resize images on CPU workers, not on the GPU. Use torchvision.transforms with num_workers=16+ or NVIDIA DALI for GPU-accelerated preprocessing if CPU becomes the bottleneck.

Mixed-type batching: Not all samples in a batch have images. Collation functions need to handle variable-length sequences with optional image tensors. This is a common source of bugs — test your dataloader independently before committing to a long training run.

Architecture Patterns

Frozen vision encoder + trainable projection (cheapest):
Freeze the ViT, train only the projection layer and LLM. Used by LLaVA 1.5. Fastest convergence, lowest GPU requirement. Good for domain-specific fine-tuning where the visual understanding doesn't need to change (e.g., medical images using a pre-trained BiomedCLIP encoder).

Full end-to-end training (most expensive):
Train vision encoder, projection, and LLM jointly. Used by GPT-4V, Gemini. Requires significantly more data and compute but produces the strongest cross-modal understanding. Only justified at 10M+ training examples.

Adapter-based multimodal (efficient middle ground):
Freeze both vision encoder and LLM, train only adapters (LoRA on the LLM, small adapter on the vision encoder). Very efficient — fits on a single A100 40GB for 13B-scale models.

Framework Options

  • LLaVA codebase: Reference implementation, well-documented, supports DeepSpeed for multi-GPU training
  • HuggingFace transformers: Native support for LLaVA, Idefics2, PaliGemma with standard Trainer
  • OpenFlamingo: Open-source Flamingo implementation, designed for multi-GPU training from the start
  • NVIDIA NeMo Multimodal: Enterprise-grade, optimized for multi-node H100 clusters
  • xtuner: Lightweight fine-tuning toolkit with strong multimodal support

Train multimodal models on io.net — from LoRA fine-tuning on a single GPU to 64-GPU pre-training clusters. Deploy now