Yes. io.net supports pre-loading datasets through persistent storage volumes, S3/GCS mounting, and pre-built Docker images with embedded datasets. Create a persistent volume, upload your dataset once, and mount it across multiple GPU instances—eliminating repeated downloads and reducing training startup time from minutes to seconds.
Methods to Pre-Load Datasets
1. Persistent Volumes (Recommended):
# Create persistent storage
io storage create --name training-data --size 500GB --region us-west
# Upload dataset
io upload large-dataset.tar.gz training-data:/datasets/
# Mount on GPU instance
io deploy --image pytorch/pytorch:latest \
--gpu A100 \
--mount training-data:/data \
--name training-job
# Dataset available at /data/datasets/ inside container
2. S3/GCS Direct Mount:
# Mount S3 bucket (no data transfer needed)
io deploy --image pytorch/pytorch:latest \
--gpu A100 \
--mount s3://my-bucket/datasets:/data \
--env AWS_ACCESS_KEY_ID=$AWS_KEY \
--env AWS_SECRET_ACCESS_KEY=$AWS_SECRET \
--name training-with-s3
# Data streams directly from S3 during training
3. Pre-Built Image with Dataset:
FROM pytorch/pytorch:latest
# Download dataset during image build
RUN wget https://example.com/dataset.tar.gz && \
tar xzf dataset.tar.gz -C /workspace/data && \
rm dataset.tar.gz
# Dataset embedded in image (ready instantly)
Performance Comparison
| Method | Startup Time | Cost | Best For |
|---|---|---|---|
| Download each time | 5-15 min | GPU idle time wasted | Small datasets (<10GB) |
| Persistent volume | <10 sec | Storage: $0.10/GB/month | Medium-large (10GB-1TB) |
| S3 mount | <5 sec | S3 costs only | Very large (>1TB), streaming |
| Embedded in image | 0 sec (instant) | Image storage | Small, static datasets |
Example: ImageNet Pre-Loading
Dataset: ImageNet (150GB)
Traditional approach (slow):
# Download every time (15+ minutes at 200 MB/s)
io deploy --image training:latest --gpu A100 \
--command "wget https://imagenet.org/data.tar && tar xf data.tar && python train.py"
# Cost: $1.10/hr × 0.25hr = $0.28 wasted per job
Persistent volume approach (fast):
# One-time setup
io storage create --name imagenet --size 200GB
io upload imagenet.tar.gz imagenet:/
# Subsequent jobs (instant)
io deploy --image training:latest --gpu A100 \
--mount imagenet:/data \
--command "python train.py --data /data/imagenet"
# Cost: $0.02/month storage, 0 wait time
Large Dataset Strategies
Streaming from S3:
# stream_dataset.py
import torch
from torch.utils.data import IterableDataset
import boto3
class S3StreamDataset(IterableDataset):
def __init__(self, bucket, prefix):
self.s3 = boto3.client('s3')
self.bucket = bucket
self.prefix = prefix
def __iter__(self):
# Stream data directly from S3 (no local storage needed)
paginator = self.s3.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=self.bucket, Prefix=self.prefix):
for obj in page['Contents']:
data = self.s3.get_object(Bucket=self.bucket, Key=obj['Key'])
yield process(data['Body'].read())
# Use in training
dataset = S3StreamDataset("my-bucket", "datasets/")
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32)
Pre-Processing Datasets
# Pre-process dataset on cheap GPU, save to volume
io deploy --image preprocessing:latest \
--gpu RTX4090 \
--mount raw-data:/input \
--mount processed-data:/output \
--command "python preprocess.py --input /input --output /output"
# Training uses preprocessed data (faster, cheaper)
io deploy --image training:latest \
--gpu H100 --count 8 \
--mount processed-data:/data \
--command "python train.py"
Pre-load datasets on io.net with persistent volumes and instant mounting.
