Pre-trained LLMs are general-purpose tools. They can summarize documents, answer questions, and generate code. But they don't know your proprietary data, your company's terminology, your domain's edge cases, or the specific output format your application requires. Fine-tuning is how you bridge that gap — taking an open-source model and training it directly on your data until it behaves like a specialist.
The barrier has always been compute. Fine-tuning requires GPU hardware with enough VRAM to hold a model, its optimizer states, and gradients simultaneously. A full fine-tune of a 70B parameter model needs eight H100 GPUs running for days. On AWS, a single run can cost $3,000-5,000. For most teams, that makes experimentation prohibitive — one shot, and if the hyperparameters are wrong or the dataset has quality issues, the money is gone.
That cost equation has changed. Parameter-efficient methods like LoRA and QLoRA have cut GPU requirements by 10-100x. And decentralized GPU networks like io.net offer the same NVIDIA hardware — H100s, A100s, RTX 4090s — at 50-70% below hyperscaler pricing, with clusters that deploy in under two minutes.
This guide covers everything you need: fine-tuning methods explained, GPU requirements by model size, a complete step-by-step tutorial for fine-tuning Llama 3 8B with LoRA on io.net, distributed training with Ray clusters, real cost comparisons across providers, and best practices that prevent wasted compute.
Fine-Tuning Methods Explained
The fine-tuning method you choose determines your GPU requirements, training time, cost, and output quality. Understanding the tradeoffs before provisioning hardware saves both money and frustration.
Full Fine-Tuning
Full fine-tuning updates every parameter in the model. For a 7B parameter model, all 7 billion weights are modified during training. This gives maximum control over model behavior and typically produces the highest quality results.
The cost is steep. During training, the GPU must hold the model weights (14GB for 7B in fp16), optimizer states (2-3x the model size for AdamW), gradients (equal to model size), and activations. A 7B model requires roughly 60-80GB of total GPU memory for full fine-tuning — meaning at least two A100 80GB GPUs. For 70B models, the requirement jumps to 8x H100 80GB or more.
Best for: Teams with large, high-quality datasets (50K+ examples), production models serving millions of users, and situations where maximum performance justifies the compute investment.
LoRA (Low-Rank Adaptation)
LoRA is the most impactful fine-tuning innovation in recent years. Instead of updating all parameters, it freezes the base model and injects small trainable matrices into each transformer layer. These adapter weights typically represent only 0.1-1% of total model parameters.
The insight behind LoRA is that the weight updates during fine-tuning have low intrinsic rank. You can approximate those updates with two small matrices (a rank decomposition) without meaningful quality loss:
Original weight matrix W: 4096 x 4096 = 16.7M parameters
LoRA decomposition: A (4096 x 16) + B (16 x 4096) = 131K parameters
Reduction: 99.2% fewer trainable parameters
In practice, LoRA achieves 90-98% of full fine-tuning quality at 10-100x lower cost. A 7B model fine-tuned with LoRA requires only 16-24GB of GPU memory — a single RTX 4090 or A100 handles it comfortably.
Best for: Most fine-tuning tasks. This is the default recommendation unless you have a specific reason to use full fine-tuning. Works well with datasets of 1K-50K examples.
QLoRA (Quantized LoRA)
QLoRA combines LoRA with 4-bit quantization of the base model. The frozen weights are compressed to 4-bit NormalFloat precision, while the LoRA adapters train in 16-bit. This cuts VRAM requirements by roughly 4x compared to standard LoRA.
A 7B model that needs 14GB in fp16 occupies only about 4GB in 4-bit quantization. This means you can fine-tune a 7B model on a consumer GPU with 12GB VRAM, or a 70B model on a single A100 80GB — tasks that would otherwise require multi-GPU setups.
The quality tradeoff versus standard LoRA is minimal, typically within 1-2% on benchmarks.
Best for: When GPU VRAM is your constraint. QLoRA is the most cost-effective way to fine-tune large models (30B-70B) without multi-GPU clusters.
Instruction Tuning vs. Domain Adaptation
These are objectives, not methods. You can achieve either one through full fine-tuning, LoRA, or QLoRA.
Instruction tuning teaches the model to follow instructions in a specific format. You train on prompt-response pairs where the input is a user instruction and the output is the desired response. This is how base models become chat models — Llama 3 Base becomes Llama 3 Instruct through instruction tuning.
Domain adaptation teaches the model the vocabulary, concepts, and reasoning patterns of a specific field — legal, medical, financial, or your company's internal knowledge. The training data is domain-specific text, and the goal is fluency in your field rather than changing how the model follows instructions.
Many production fine-tunes combine both: first adapt the model to your domain's knowledge, then instruction-tune it for your specific task format and output style.
GPU Requirements by Method
Choosing the right GPU prevents two expensive mistakes: overprovisioning (paying for hardware you don't need) and underprovisioning (running out of memory mid-training and losing hours of compute). This table maps method, model size, and GPU requirements.
| Method | Model Size | Minimum GPU | Recommended GPU | Estimated Time |
|---|---|---|---|---|
| Full fine-tune | 7B | 2x A100 80GB | 4x A100 80GB | ~24 hours |
| Full fine-tune | 13B | 4x A100 80GB | 4x H100 80GB | ~36 hours |
| Full fine-tune | 70B | 8x H100 80GB | 16x H100 80GB | ~1 week |
| LoRA | 7B | 1x RTX 4090 (24GB) | 1x A100 80GB | ~4 hours |
| LoRA | 13B | 1x A100 40GB | 1x A100 80GB | ~6 hours |
| LoRA | 70B | 2x A100 80GB | 4x A100 80GB | ~18 hours |
| QLoRA | 7B | 1x RTX 4090 (24GB) | 1x RTX 4090 (24GB) | ~4 hours |
| QLoRA | 13B | 1x RTX 4090 (24GB) | 1x A100 40GB | ~6 hours |
| QLoRA | 70B | 1x A100 80GB | 1x A100 80GB | ~8 hours |
How to read this table: "Minimum GPU" is the smallest configuration that completes the job without out-of-memory errors. "Recommended GPU" adds headroom for larger batch sizes, faster convergence, and less risk of OOM failures. Times assume a dataset of 10K-50K examples with standard hyperparameters.
What this means for cost on io.net:
- QLoRA 7B on an RTX 4090 ($0.40-0.80/hr): $1.60-3.20 total
- LoRA 7B on an A100 80GB ($1.20-2.00/hr): $4.80-8.00 total
- QLoRA 70B on an A100 80GB ($1.20-2.00/hr): $9.60-16.00 total
- Full fine-tune 70B on 8x H100 ($2.10-3.50/hr each): $2,822-4,704 total
The QLoRA path makes fine-tuning accessible at any budget. A complete 7B fine-tune for under $5 is not a promotional claim — it's straightforward math.
Step-by-Step: Fine-Tune Llama 3 8B with LoRA on io.net
This tutorial walks through fine-tuning Meta's Llama 3 8B model using LoRA on an A100 80GB GPU deployed through io.net. By the end, you'll have a custom model adapted to your dataset, merged, and ready for deployment.
Step 1: Deploy an A100 on io.cloud
- Log in to cloud.io.net.
- Navigate to Deploy and select GPU Cloud.
- Choose A100 80GB from the GPU catalog ($1.20-2.00/hr). For a budget option, an RTX 4090 ($0.40-0.80/hr) works with QLoRA.
- Select a container image with CUDA 12.x and PyTorch pre-installed, or start from a bare Ubuntu image.
- Click Deploy. The cluster will be ready in under 2 minutes.
- Connect via SSH or the web terminal.
[IMAGE: io.net GPU deployment interface showing A100 80GB selection and cluster configuration]
Step 2: Install Dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers==4.44.0 \
peft==0.12.0 \
datasets==2.20.0 \
accelerate==0.33.0 \
bitsandbytes==0.43.0 \
trl==0.9.0 \
wandb
pip install flash-attn --no-build-isolation
| Package | Purpose |
|---|---|
transformers | Model loading, tokenization, Trainer API |
peft | LoRA/QLoRA adapter implementation |
bitsandbytes | 4-bit quantization for QLoRA |
trl | SFTTrainer for supervised fine-tuning |
datasets | Dataset loading and preprocessing |
accelerate | Multi-GPU and mixed-precision orchestration |
flash-attn | FlashAttention 2 for faster, memory-efficient training |
wandb | Experiment tracking (optional but recommended) |
Verify your GPU is detected:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
Expected output:
CUDA available: True
GPU: NVIDIA A100-SXM4-80GB
VRAM: 80.0 GB
Step 3: Prepare Your Dataset
Fine-tuning datasets need a consistent structure. The Alpaca format is one of the most widely used: each example has an instruction, optional input, and output.
from datasets import Dataset
# Example: domain-specific dataset in Alpaca format
data = [
{
"instruction": "Explain how io.net GPU pricing works.",
"input": "",
"output": "io.net uses a decentralized marketplace model where GPU providers compete on price. On-demand A100 80GB instances typically cost $1.20-2.00/hr, which is 60-70% less than equivalent instances on AWS or GCP. There are no egress fees, no reserved instance commitments, and billing is per-minute."
},
{
"instruction": "What GPU should I use for fine-tuning a 7B parameter model?",
"input": "I have a budget of $20 and a dataset of 10,000 examples.",
"output": "Use QLoRA on a single RTX 4090 (24GB VRAM). Training will take approximately 3-4 hours. On io.net, RTX 4090 instances cost $0.40-0.80/hr, so your total compute cost will be $1.60-3.20. You could run multiple experimental iterations with different hyperparameters well within your $20 budget."
},
# ... add 1,000-10,000 more examples
]
dataset = Dataset.from_list(data)
# Format into the prompt template
def format_prompt(example):
if example["input"]:
return {"text": f"""### Instruction:
{example["instruction"]}
### Input:
{example["input"]}
### Response:
{example["output"]}"""}
else:
return {"text": f"""### Instruction:
{example["instruction"]}
### Response:
{example["output"]}"""}
dataset = dataset.map(format_prompt)
# Train/validation split
dataset = dataset.train_test_split(test_size=0.1, seed=42)
print(f"Train: {len(dataset['train'])} | Validation: {len(dataset['test'])}")
Data quality is the single biggest determinant of fine-tuning success. Before training:
- Manually review at least 100 examples for accuracy, formatting, and consistency
- Remove duplicates — they cause memorization instead of generalization
- Balance your dataset across categories so the model doesn't overfit to one topic
- Match the output format exactly to what you want in production
Step 4: Configure LoRA
from peft import LoraConfig, TaskType
lora_config = LoraConfig(
r=16, # Rank of update matrices (8-64)
lora_alpha=32, # Scaling factor (common: alpha = 2 * r)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj", # Attention projections
"gate_proj", "up_proj", "down_proj" # MLP projections
],
lora_dropout=0.05, # Regularization (0.05-0.1)
bias="none", # Don't train bias parameters
task_type=TaskType.CAUSAL_LM, # Causal language modeling
)
Parameter guidance:
r(rank): Controls adapter capacity.r=16is the standard starting point. Increase to 32-64 for complex tasks or if quality is insufficient. Lower to 8 for simple format changes.lora_alpha: Scaling factor for adapter influence. The ratioalpha / rdetermines how strongly adapters affect the output. Settingalpha = 2 * ris a reliable default.target_modules: Which layers receive LoRA adapters. Targeting all attention and MLP projections (as shown) gives the best results for Llama-architecture models. Targeting only attention layers is faster but slightly lower quality.lora_dropout: 0.05 works for datasets with 2K+ examples. Increase to 0.1 for smaller datasets. Set to 0 for very large datasets (50K+).
Step 5: Run Training
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# ── Model ────────────────────────────────────────────────────────────
model_id = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# 4-bit quantization config (remove for standard LoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52%
# ── Training Configuration ───────────────────────────────────────────
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size: 16
learning_rate=2e-4,
weight_decay=0.01,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="steps",
save_steps=100,
eval_strategy="steps",
eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
bf16=True,
gradient_checkpointing=True, # ~40% VRAM savings, ~20% speed cost
optim="paged_adamw_32bit", # Memory-efficient optimizer
max_grad_norm=0.3,
save_total_limit=3,
report_to="wandb", # Remove if not tracking experiments
)
# ── Trainer ──────────────────────────────────────────────────────────
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
tokenizer=tokenizer,
args=training_args,
dataset_text_field="text",
max_seq_length=2048,
packing=True, # Pack short examples for efficiency
)
# ── Train ────────────────────────────────────────────────────────────
trainer.train()
# Save adapter weights
trainer.save_model("./final-adapter")
tokenizer.save_pretrained("./final-adapter")
print("Training complete. Adapter saved to ./final-adapter")
What to expect: Training a 7B model with LoRA on 10K examples takes 2-4 hours on a single A100 80GB. Monitor training loss — it should decrease steadily. If it plateaus early, your learning rate may be too low. If it oscillates, it's too high. Validation loss should track training loss; if it diverges upward, you're overfitting.
Step 6: Merge Weights and Export
After training, merge the LoRA adapter back into the base model to produce a standalone model that can be deployed without the PEFT library.
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load base model in full precision for merging
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Load and merge adapter
model = PeftModel.from_pretrained(base_model, "./final-adapter")
merged_model = model.merge_and_unload()
# Save merged model
merged_model.save_pretrained("./llama3-8b-finetuned")
tokenizer = AutoTokenizer.from_pretrained("./final-adapter")
tokenizer.save_pretrained("./llama3-8b-finetuned")
print("Merged model saved to ./llama3-8b-finetuned")
Quick validation test:
from transformers import pipeline
pipe = pipeline("text-generation", model="./llama3-8b-finetuned", torch_dtype=torch.bfloat16, device_map="auto")
result = pipe(
"### Instruction:\nExplain how GPU cloud pricing works.\n\n### Response:\n",
max_new_tokens=256, temperature=0.7, do_sample=True,
)
print(result[0]["generated_text"])
Step 7: Deploy to Production
Option A: Serve with vLLM on io.net
vLLM is the fastest open-source inference engine for LLMs. Deploy it directly on your io.net GPU instance for production-grade serving with an OpenAI-compatible API.
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model ./llama3-8b-finetuned \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--dtype bfloat16
Query the endpoint:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "./llama3-8b-finetuned",
"prompt": "### Instruction:\nSummarize Q1 performance.\n\n### Response:\n",
"max_tokens": 256,
"temperature": 0.7
}'
Option B: Upload to Hugging Face Hub
huggingface-cli upload your-org/llama3-8b-finetuned ./llama3-8b-finetuned
Once on the Hub, the model can be pulled into any inference framework — vLLM, TGI, Ollama — or deployed through io.net's managed inference platform.

Advanced: Distributed Fine-Tuning with Ray Clusters
For larger models (30B+) or when you need faster iteration cycles, distributed training across multiple GPUs is essential. io.net supports Ray Clusters for distributed workloads, enabling multi-GPU training with minimal code changes.
Multi-GPU LoRA with DeepSpeed ZeRO Stage 3
DeepSpeed ZeRO Stage 3 partitions model parameters, gradients, and optimizer states across GPUs. Combined with LoRA, this lets you fine-tune models that wouldn't fit on any single GPU while achieving near-linear scaling.
Create ds_config.json:
{
"bf16": { "enabled": true },
"zero_optimization": {
"stage": 3,
"offload_param": { "device": "none" },
"offload_optimizer": { "device": "none" },
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto"
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
}
Launch multi-GPU training:
accelerate launch --num_processes 4 \
--use_deepspeed \
--deepspeed_config_file ds_config.json \
train.py
Ray Cluster Setup on io.net
io.net's Ray Cluster integration handles distributed training orchestration. Deploy a cluster through the io.cloud interface:
- Select Ray Cluster as your deployment type in io.cloud.
- Choose your GPU configuration (e.g., 4x H100 80GB or 8x A100 80GB).
- The cluster deploys in under 2 minutes with Ray, CUDA, and PyTorch pre-configured.
Submit a distributed training job:
import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
ray.init(address="auto")
def train_func():
"""Executed on each worker GPU."""
import torch
from transformers import AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
# Same configuration as single-GPU example
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
torch_dtype=torch.bfloat16,
)
lora_config = LoraConfig(
r=16, lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
model = get_peft_model(model, lora_config)
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=4,
bf16=True,
deepspeed="./ds_config.json",
)
# Ray handles data sharding and gradient synchronization
trainer = SFTTrainer(model=model, args=training_args, ...)
trainer.train()
scaling_config = ScalingConfig(
num_workers=4, # Number of GPU workers
use_gpu=True,
resources_per_worker={"GPU": 1},
)
trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=scaling_config,
)
result = trainer.fit()
Scaling impact: A LoRA fine-tune of a 70B model that takes 18 hours on a single GPU drops to roughly 5-6 hours on a 4-GPU Ray cluster. With 8 GPUs, it completes in under 3 hours. io.net's DePIN architecture means these multi-GPU clusters are available across 130+ countries — no waitlists, no capacity reservations.
Cost Breakdown: Fine-Tuning on io.net vs. Alternatives
The same training job costs dramatically different amounts depending on your GPU cloud provider. Here is a direct comparison across four common scenarios, using published pricing as of early 2026.
Scenario 1: QLoRA 7B Model (1x RTX 4090, 4 hours)
| Provider | GPU Rate | Total Cost | vs. io.net |
|---|---|---|---|
| io.net | $0.40-0.80/hr | $1.60-3.20 | -- |
| RunPod | $0.44/hr | $1.76 | comparable |
| Vast.ai | $0.25/hr | $1.00 | -47% |
The RTX 4090 is not available on any hyperscaler. For this budget tier, io.net and RunPod are comparable, while Vast.ai offers lower spot rates at the cost of less consistent availability and fewer cluster-level features.
Scenario 2: LoRA 7B Model (1x A100 80GB, 4 hours)
| Provider | GPU Rate | Total Cost | vs. io.net |
|---|---|---|---|
| io.net | $1.20-2.00/hr | $4.80-8.00 | -- |
| Lambda | $1.29/hr | $5.16 | comparable |
| RunPod | $1.64/hr | $6.56 | +23% |
| CoreWeave | $2.06/hr | $8.24 | +42% |
| AWS (p4d) | $5.12/hr* | $20.48 | +200% |
| GCP (a2) | $3.67/hr | $14.68 | +130% |
*AWS p4d instances include 8 A100s — per-GPU cost derived from instance price. You pay for all 8 GPUs even if you need only 1.
Scenario 3: QLoRA 70B Model (1x A100 80GB, 8 hours)
| Provider | GPU Rate | 8-Hour Cost | vs. io.net |
|---|---|---|---|
| io.net | $1.20-2.00/hr | $9.60-16.00 | -- |
| Lambda | $1.29/hr | $10.32 | comparable |
| RunPod | $1.64/hr | $13.12 | +23% |
| AWS (p4d) | $5.12/hr* | $40.96 | +200% |
Scenario 4: Full Fine-Tune 70B Model (8x H100, 7 days)
| Provider | GPU Rate (each) | Total Cost | vs. io.net |
|---|---|---|---|
| io.net | $2.10-3.50/hr | $2,822-4,704 | -- |
| Lambda | $2.49/hr | $3,346 | +6% |
| RunPod | $2.69/hr | $3,614 | +15% |
| CoreWeave | $2.99/hr | $4,019 | +22% |
| AWS (p5) | $6.88/hr | $9,249 | +150% |
At the 70B full fine-tuning scale, provider choice determines whether a training run costs $3,000 or $9,000. A team running 5 iterative training runs to tune hyperparameters and dataset composition would spend $46,000+ on AWS versus $14,000-23,500 on io.net — a savings of $23,000-32,000 per experiment cycle.
Why io.net Is Structurally Cheaper
io.net's pricing advantage is not promotional. It is structural. As a decentralized GPU network (DePIN) with 320,000+ GPUs across 130+ countries, io.net aggregates supply from data centers, enterprises with idle capacity, and GPU mining operations. Marketplace competition among providers drives pricing toward marginal cost.
Additional cost advantages:
- No egress fees. Downloading your fine-tuned model costs $0. On AWS, exporting a 16GB model incurs $1.44 in egress charges.
- Per-minute billing. If training finishes in 4 hours and 12 minutes, you pay for 4:12. Not 5 hours.
- No reserved commitments. Scale to 8 GPUs for one job, then back to zero. No contracts, no minimums.
- Sub-2-minute deployment. Clusters deploy through Kubernetes, Containers, VMs, or Bare Metal in under 2 minutes.
Best Practices for Fine-Tuning LLMs
GPU compute is only part of the equation. These practices determine whether your fine-tuning run produces a useful model or wastes your budget.
Data Quality Over Quantity
The single biggest factor in fine-tuning quality is your training data. 1,000 carefully curated examples consistently outperform 50,000 noisy ones.
- Manually review 100+ examples before starting training. If the data has formatting issues, factual errors, or inconsistent quality, the model will learn those patterns.
- Deduplicate aggressively. Duplicate examples cause memorization, not generalization.
- Balance your dataset across categories. If 80% of examples cover one topic, the model will be heavily biased toward it.
- Match your production format exactly. If you want JSON output, every training example should have JSON output.
Learning Rate Selection
The learning rate is the most impactful hyperparameter. Start with these defaults and adjust based on loss curves:
- LoRA/QLoRA:
2e-4. If loss doesn't decrease after 50 steps, increase to5e-4. If it oscillates, decrease to1e-4. - Full fine-tuning:
2e-5(10x lower). Full fine-tuning is more sensitive because all parameters update simultaneously. - Always use cosine scheduling with 3-5% warmup to prevent early instability and support smooth convergence.
Evaluation Sets and Early Stopping
Never train blind. Hold out 10% of your data as a validation set and monitor validation loss during training.
# In TrainingArguments:
eval_strategy="steps",
eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
save_total_limit=3,
If validation loss increases for 3+ consecutive evaluations while training loss keeps falling, the model is overfitting. Stop training and use the checkpoint with the lowest validation loss.
Checkpoint Frequently
GPU instances can fail. Network connections drop. Save checkpoints every 100-200 steps so you can resume without starting over.
save_strategy="steps",
save_steps=100,
resume_from_checkpoint=True, # Auto-resume if training restarts
On io.net, attach persistent storage to your instance before training. Checkpoints survive across sessions and instance restarts.
Frequently Asked Questions
How much VRAM do I need to fine-tune a 7B parameter model?
It depends on the method. With QLoRA (4-bit quantization + LoRA), you need approximately 10-14GB total — a single RTX 4090 (24GB) handles this with headroom. With standard LoRA in fp16, you need 16-24GB. Full fine-tuning requires 60-80GB, meaning at least 2x A100 80GB GPUs. QLoRA is the recommended starting point because it offers the best cost-to-quality ratio.
Is LoRA fine-tuning as good as full fine-tuning?
For most use cases, yes. Research and practical benchmarks show LoRA achieves 90-98% of full fine-tuning quality. The gap narrows with higher rank values (r=32-64) and when targeting more transformer layers. Full fine-tuning has a meaningful advantage only with very large datasets (100K+ examples) and when you need to fundamentally alter the model's behavior rather than specialize it.
How long does it take to fine-tune Llama 3 8B?
On a single A100 80GB with LoRA and 10K examples: 2-4 hours. With QLoRA on an RTX 4090: 3-5 hours (slightly slower due to quantization overhead). With a 4-GPU distributed setup: under 1 hour. Training time scales roughly linearly with dataset size — 50K examples takes approximately 5x longer than 10K.
Can I fine-tune without writing code?
Platforms like Hugging Face AutoTrain, Together AI, and OpenAI's fine-tuning API offer simplified interfaces. However, they provide limited control over hyperparameters, data preprocessing, and model architecture. For production-quality fine-tuning where you need to iterate on configuration, writing code gives you the necessary control. The code in this guide is production-ready and can be adapted to any dataset with minimal changes.
What is the minimum dataset size for effective fine-tuning?
For instruction tuning (teaching the model a specific task format), 500-1,000 high-quality examples can produce useful results. For domain adaptation (teaching new knowledge), 5,000-50,000 examples is typical. Quality always trumps quantity — 500 carefully curated, expert-written examples will outperform 10,000 noisy or machine-generated ones.
How do I evaluate whether my fine-tuned model is good enough?
Define evaluation criteria before training, not after. Create a test set of 50-100 examples representing real production use cases. Score the model's outputs on accuracy, format compliance, and relevance. Compare against the base model and, if possible, against GPT-4 or Claude outputs on the same inputs. If your fine-tuned 8B model matches 90%+ of GPT-4 quality on your specific task, you've succeeded — and you'll serve it at a fraction of the cost.
Conclusion
Fine-tuning is the bridge between a general-purpose LLM and a model that works for your specific use case. The method you choose — LoRA for most tasks, QLoRA when VRAM-constrained, full fine-tuning for maximum quality — determines your GPU requirements and cost. But the most important variable is data quality, regardless of compute budget.
The tooling has matured to the point where a single developer with a clear dataset can fine-tune a 7B model in a few hours for under $10 in compute. For larger models, distributed training with Ray clusters scales linearly across GPUs.
io.net makes the compute side of this equation accessible. A100 80GB GPUs start at $1.20/hr. H100 SXM GPUs start at $2.10/hr. RTX 4090s start at $0.40/hr. Clusters deploy in under 2 minutes across thousands of GPUs in 130+ countries. There are no egress fees, no reserved commitments, and no minimum spend. Per-minute billing means you pay only for the compute you actually use.
The tutorial above is a complete, runnable workflow. Point it at your data, deploy a GPU on io.net, and start training.
Start fine-tuning on io.net -- deploy a GPU cluster in under 2 minutes