FAQ: Can I pre-load datasets on io.net GPUs?

Yes. io.net supports pre-loading datasets through persistent storage volumes, S3/GCS mounting, and pre-built Docker images with embedded datasets. Create a persistent volume, upload your dataset once, and mount it across multiple GPU instances—eliminating repeated downloads and reducing training startup time from minutes to seconds.

Methods to Pre-Load Datasets

1. Persistent Volumes (Recommended):

# Create persistent storage
io storage create --name training-data --size 500GB --region us-west

# Upload dataset
io upload large-dataset.tar.gz training-data:/datasets/

# Mount on GPU instance
io deploy --image pytorch/pytorch:latest \
  --gpu A100 \
  --mount training-data:/data \
  --name training-job

# Dataset available at /data/datasets/ inside container

2. S3/GCS Direct Mount:

# Mount S3 bucket (no data transfer needed)
io deploy --image pytorch/pytorch:latest \
  --gpu A100 \
  --mount s3://my-bucket/datasets:/data \
  --env AWS_ACCESS_KEY_ID=$AWS_KEY \
  --env AWS_SECRET_ACCESS_KEY=$AWS_SECRET \
  --name training-with-s3

# Data streams directly from S3 during training

3. Pre-Built Image with Dataset:

FROM pytorch/pytorch:latest

# Download dataset during image build
RUN wget https://example.com/dataset.tar.gz && \
    tar xzf dataset.tar.gz -C /workspace/data && \
    rm dataset.tar.gz

# Dataset embedded in image (ready instantly)

Performance Comparison

Method	Startup Time	Cost	Best For
Download each time	5-15 min	GPU idle time wasted	Small datasets (<10GB)
Persistent volume	<10 sec	Storage: $0.10/GB/month	Medium-large (10GB-1TB)
S3 mount	<5 sec	S3 costs only	Very large (>1TB), streaming
Embedded in image	0 sec (instant)	Image storage	Small, static datasets

Example: ImageNet Pre-Loading

Dataset: ImageNet (150GB)

Traditional approach (slow):

# Download every time (15+ minutes at 200 MB/s)
io deploy --image training:latest --gpu A100 \
  --command "wget https://imagenet.org/data.tar && tar xf data.tar && python train.py"

# Cost: $1.10/hr × 0.25hr = $0.28 wasted per job

Persistent volume approach (fast):

# One-time setup
io storage create --name imagenet --size 200GB
io upload imagenet.tar.gz imagenet:/

# Subsequent jobs (instant)
io deploy --image training:latest --gpu A100 \
  --mount imagenet:/data \
  --command "python train.py --data /data/imagenet"

# Cost: $0.02/month storage, 0 wait time

Large Dataset Strategies

Streaming from S3:

# stream_dataset.py
import torch
from torch.utils.data import IterableDataset
import boto3

class S3StreamDataset(IterableDataset):
    def __init__(self, bucket, prefix):
        self.s3 = boto3.client('s3')
        self.bucket = bucket
        self.prefix = prefix

    def __iter__(self):
        # Stream data directly from S3 (no local storage needed)
        paginator = self.s3.get_paginator('list_objects_v2')
        for page in paginator.paginate(Bucket=self.bucket, Prefix=self.prefix):
            for obj in page['Contents']:
                data = self.s3.get_object(Bucket=self.bucket, Key=obj['Key'])
                yield process(data['Body'].read())

# Use in training
dataset = S3StreamDataset("my-bucket", "datasets/")
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)

Pre-Processing Datasets

# Pre-process dataset on cheap GPU, save to volume
io deploy --image preprocessing:latest \
  --gpu RTX4090 \
  --mount raw-data:/input \
  --mount processed-data:/output \
  --command "python preprocess.py --input /input --output /output"

# Training uses preprocessed data (faster, cheaper)
io deploy --image training:latest \
  --gpu H100 --count 8 \
  --mount processed-data:/data \
  --command "python train.py"

Pre-load datasets on io.net with persistent volumes and instant mounting.