Multimodal models — LLaVA, GPT-4V-class architectures, Flamingo derivatives, CLIP variants — process text, images, video, and audio simultaneously. Training them is more GPU-intensive than unimodal LLMs because you're encoding multiple data types through separate towers and fusing them through cross-attention or projection layers. The data pipeline alone is 3-5x more complex.
Here's what the GPU requirements actually look like, from small-scale fine-tuning to pre-training.
Why Multimodal Needs More GPU
A text-only 7B LLM processes tokenized sequences. Straightforward. A multimodal model of equivalent capability does all of that plus:
- Encodes images through a vision transformer (ViT-L is 300M params, consuming 4-8GB VRAM for the encoder alone)
- Processes variable-resolution images (higher res = more patches = more memory)
- Maintains a cross-modal projection layer that bridges visual features to the LLM's embedding space
- Handles mixed-length batches where some examples have images and others don't
The memory overhead versus a text-only model of the same LLM backbone: roughly 40-60% more VRAM during training.
GPU Requirements by Task
Fine-tuning a multimodal model (e.g., LLaVA 1.5 on custom data):
| Model Size | Minimum GPU | Recommended | Cost on io.net |
|---|---|---|---|
| LLaVA 7B (LoRA) | RTX 4090 (24GB) | A100 40GB | $0.18-$1.20/hr |
| LLaVA 13B (LoRA) | A100 40GB | A100 80GB | $1.20-$1.49/hr |
| LLaVA 34B (LoRA) | A100 80GB | 2x A100 80GB | $1.49-$2.98/hr |
Fine-tuning LLaVA 7B on 50K image-text pairs with LoRA: approximately 6 hours on a single A100 40GB, costing about $7.20 on io.net.
Pre-training a multimodal model from scratch:
This is where things get expensive. Even a relatively small multimodal pre-training run requires:
| Scale | GPUs | Duration | Cost on io.net |
|---|---|---|---|
| Small (1B vision + 7B LLM) | 8x A100 80GB | 3-7 days | $860-$2,000 |
| Medium (1B vision + 13B LLM) | 16x A100 80GB | 7-14 days | $4,000-$8,000 |
| Large (2B vision + 70B LLM) | 64x H100 SXM | 14-30 days | $47,000-$101,000 |
For context, Meta's LLaVA-NeXT was trained on 32x A100 80GB for several days. Qwen-VL used 32x A100s. These are research-lab-scale runs that are increasingly achievable on io.net at a fraction of hyperscaler pricing.
Data Pipeline Considerations
Multimodal training is almost always bottlenecked by the data pipeline, not GPU compute. Images are large, decoding is CPU-intensive, and augmentations add latency.
Storage: Image-text datasets are big. A 10M image-text pair dataset (like a subset of LAION) is 5-20TB. Use NVMe-backed persistent volumes on io.net with pre-processed WebDataset shards for maximum throughput.
Preprocessing: Decode and resize images on CPU workers, not on the GPU. Use torchvision.transforms with num_workers=16+ or NVIDIA DALI for GPU-accelerated preprocessing if CPU becomes the bottleneck.
Mixed-type batching: Not all samples in a batch have images. Collation functions need to handle variable-length sequences with optional image tensors. This is a common source of bugs — test your dataloader independently before committing to a long training run.
Architecture Patterns
Frozen vision encoder + trainable projection (cheapest):
Freeze the ViT, train only the projection layer and LLM. Used by LLaVA 1.5. Fastest convergence, lowest GPU requirement. Good for domain-specific fine-tuning where the visual understanding doesn't need to change (e.g., medical images using a pre-trained BiomedCLIP encoder).
Full end-to-end training (most expensive):
Train vision encoder, projection, and LLM jointly. Used by GPT-4V, Gemini. Requires significantly more data and compute but produces the strongest cross-modal understanding. Only justified at 10M+ training examples.
Adapter-based multimodal (efficient middle ground):
Freeze both vision encoder and LLM, train only adapters (LoRA on the LLM, small adapter on the vision encoder). Very efficient — fits on a single A100 40GB for 13B-scale models.
Framework Options
- LLaVA codebase: Reference implementation, well-documented, supports DeepSpeed for multi-GPU training
- HuggingFace transformers: Native support for LLaVA, Idefics2, PaliGemma with standard Trainer
- OpenFlamingo: Open-source Flamingo implementation, designed for multi-GPU training from the start
- NVIDIA NeMo Multimodal: Enterprise-grade, optimized for multi-node H100 clusters
- xtuner: Lightweight fine-tuning toolkit with strong multimodal support
Train multimodal models on io.net — from LoRA fine-tuning on a single GPU to 64-GPU pre-training clusters. Deploy now
