Fractional GPUs let you rent a portion of a GPU — say, 8GB of a 24GB RTX 4090 or 20GB of an 80GB A100 — instead of paying for the whole card. It's the GPU equivalent of shared hosting versus a dedicated server. You get access to a slice of the compute and memory at a fraction of the full price.

Not every cloud provider offers this, and the devil is in the details. Some fractional GPU implementations are excellent for lightweight workloads. Others introduce performance variability that makes them unsuitable for production. Here's when fractional GPUs make sense and when you should just rent the whole card.

How GPU Sharing Works

There are three common mechanisms:

1. NVIDIA MPS (Multi-Process Service)
Multiple CUDA processes share a single GPU. Each process gets a proportional share of compute time. Memory is partitioned but the CUDA cores are time-shared.
- Isolation: Moderate. One noisy neighbor can slow everyone down.
- Overhead: 5-10% for scheduling.
- Best for: Development environments, low-throughput inference.

2. NVIDIA MIG (Multi-Instance GPU) — A100 and H100 only
Hardware-level partitioning. An A100 80GB can split into up to 7 independent instances (each with 10GB VRAM) or 3 instances (each with ~20-27GB). Each instance has its own dedicated compute and memory — true isolation.
- Isolation: Excellent. Hardware-enforced, no noisy neighbors.
- Overhead: Near zero — hardware partitioning.
- Best for: Multi-tenant production inference, shared development clusters.

3. Time-sharing / Virtual GPUs
Software-based scheduling that gives each workload a time slice on the GPU. Like running multiple programs on a single CPU core, but for GPUs.
- Isolation: Weak. Performance is unpredictable.
- Overhead: 15-30% from context switching.
- Best for: Only when no other option exists.

When Fractional GPUs Save Money

The economics work out when your workload doesn't need a full GPU:

Small model inference (< 4GB VRAM needed):
Running a small embedding model, Whisper tiny/base, or a lightweight classifier. These models use a fraction of even an RTX 4090's memory. Renting a full card means paying for 20GB of idle VRAM.

ApproachMonthly costUtilization
Full RTX 4090 (24/7)$129.6015%
1/3 fractional RTX 4090~$45-5045%
Shared A100 via MIG (10GB slice)~$35-4070%

Development and experimentation:
When you're writing code, running small test batches, and iterating. You don't need sustained GPU compute — just enough to validate your code works on GPU before scaling up.

Low-traffic inference APIs:
Serving 10-50 requests per minute. A full GPU would sit idle 90% of the time.

When You Should Use the Whole GPU

Training of any kind. Even LoRA fine-tuning benefits from full GPU access. Training saturates GPU compute and memory — sharing means slower convergence, which costs more in wall-clock time than you save on the hourly rate.

Production inference at scale. If your GPU utilization is above 50%, fractional GPUs add overhead and variability without saving money. Just use the full card.

Latency-sensitive workloads. Fractional GPUs introduce latency jitter from co-tenant scheduling. If your SLA requires consistent sub-100ms inference, avoid sharing.

Models that use most of the VRAM. If your model occupies 16GB+ on a 24GB card, there's nothing meaningful to share. The overhead of fractional scheduling just slows you down.

The io.net Approach

On io.net, the economics of full GPU rental are already so competitive that fractional GPUs are less necessary than on hyperscalers. An RTX 4090 at $0.18/hr is already cheap enough that renting the full card for a lightweight workload costs only $4.32/day — often less than fractional pricing on other platforms.

That said, for teams running multiple lightweight models or shared development environments, io.net supports GPU sharing through container-based isolation. You can run multiple containers on a single GPU, each with a defined memory limit:

# Run two containers on one GPU, each limited to 12GB VRAM
docker run --gpus '"device=0"' --shm-size=12g model-a:latest &
docker run --gpus '"device=0"' --shm-size=12g model-b:latest &

For A100 clusters, MIG partitioning is available on enterprise plans, providing hardware-isolated GPU slices for multi-tenant deployments.

Decision Framework

Your situationRecommendation
Small model, low trafficFractional/shared — save 50-65%
Development/testingFractional — no need for full power
Training (any model)Full GPU — always
Production inference, >50% utilizationFull GPU — overhead not worth it
Multiple small models, one GPUContainer sharing — practical middle ground
Multi-tenant enterpriseMIG on A100/H100 — hardware isolation

Right-size your GPU spend on io.net — from $0.18/hr for a full RTX 4090, with container sharing for lightweight workloads. Explore options