Meta's Llama 3 family has become the default open-source large language model for production AI applications. But running Llama 3 -- especially the 70B and 405B parameter variants -- demands serious GPU horsepower. Traditional cloud providers charge $12+ per hour for a single H100, and availability is gated behind waitlists and long-term contracts.
This guide walks you through three ways to deploy Llama 3 on a GPU cloud using io.net, a decentralized compute platform with 300,000+ GPUs across 130+ countries. You will learn how to pick the right deployment path for your workload, select the correct GPU configuration for each model size, and get an inference endpoint running in minutes -- at up to 70% lower cost than centralized providers.
Whether you need a quick API call for prototyping or a multi-node cluster serving the full 405B model in production, this guide covers every option with working code examples.
What Is Llama 3 and Why Deploy It on a GPU Cloud?
Llama 3 is Meta's family of open-weight large language models released under a permissive license that allows commercial use. The family spans three primary sizes:
| Model | Parameters | Context Length | Typical Use Cases |
|---|---|---|---|
| Llama 3 8B | 8 billion | 8,192 tokens | Chatbots, classification, lightweight tasks |
| Llama 3 70B | 70 billion | 8,192 tokens | Complex reasoning, code generation, enterprise apps |
| Llama 3.1 405B | 405 billion | 128K tokens | Research, multilingual, state-of-the-art open-source performance |
Running these models locally is impractical for most teams. The 8B model needs at least 16 GB of VRAM. The 70B model requires 140+ GB. The 405B model demands a cluster of eight or more enterprise GPUs with fast interconnects. Cloud GPU infrastructure solves this: you rent exactly the hardware you need, pay by the hour, and skip a six-figure capital outlay on server hardware.
Why io.net for Llama 3 Deployment
io.net offers several advantages when you need to deploy Llama 3 on a GPU cloud:
- No waitlists. Access H100, A100, and RTX 4090 GPUs on demand. No approval gates.
- Cost savings. H100 instances start at $2.19/hr versus $12.29/hr on AWS -- a 70%+ reduction.
- Fast provisioning. Clusters deploy in under 2 minutes.
- Multiple deployment modes. Choose from a managed inference API (io.intelligence), containers, Ray clusters, or bare metal.
- No lock-in. No long-term contracts. Scale up or down hourly. Pay with credit card or Solana wallet.
- Global reach. 130+ countries, so you can place compute close to your users.
Prerequisites
Before deploying, gather three things:
1. An io.net Account
Sign up at io.net and complete verification. io.net supports payment via credit card or Solana wallet. There is no approval process or waitlist -- you can deploy immediately after signup. The io.intelligence API includes free daily limits per model, so you can start without entering payment details.
2. A Hugging Face Account (Options 2 and 3 Only)
If you plan to self-host Llama 3 rather than using the managed API, you need access to the model weights. Visit huggingface.co/meta-llama, accept Meta's license agreement, and request access. Approval is typically granted within a few hours.
Generate a Hugging Face access token at huggingface.co/settings/tokens. You will use this token to download weights during deployment.
3. GPU Selection by Model Size
Choosing the right GPU configuration is critical. Under-provisioning means the model will not load. Over-provisioning means paying for hardware you do not need.
| Model | Precision | VRAM Required | Recommended GPU on io.net | Estimated Cost |
|---|---|---|---|---|
| Llama 3 8B | FP16 | ~16 GB | 1x RTX 4090 (24 GB) | $0.40-$0.80/hr |
| Llama 3 8B | INT4 (AWQ) | ~6 GB | 1x RTX 4090 (24 GB) | $0.40-$0.80/hr |
| Llama 3 70B | FP16 | ~140 GB | 2x A100 80 GB | $2.40-$4.00/hr |
| Llama 3 70B | INT4 (AWQ) | ~36 GB | 1x A100 80 GB | $1.20-$2.00/hr |
| Llama 3.1 405B | FP16 | ~810 GB | 8x H100 80 GB | ~$18-$28/hr |
| Llama 3.1 405B | FP8 | ~405 GB | 4x H100 80 GB | ~$9-$14/hr |
Tip: For development and testing, start with quantized models on smaller GPUs. Move to full-precision on larger hardware only when you need maximum quality in production.
Option 1: Use the io.intelligence API (Fastest -- Zero Setup)
If you need Llama 3 inference right now and do not want to manage any infrastructure, io.intelligence is the fastest path. io.net hosts 25+ open-source models -- including Llama 3.1, Llama 3.2, and Llama 3.3 -- behind a fully OpenAI-compatible API. You can go from zero to working inference in under two minutes.
When to Use This Option
- Prototyping and validating prompts before committing to self-hosted infrastructure
- Low-to-moderate volume production inference
- No custom or fine-tuned weights needed
- You want a drop-in replacement for the OpenAI API
Step 1: Get Your API Key
- Navigate to io.net/intelligence.
- Sign in with your io.net account.
- Generate a new API key from Settings > API Keys.
Step 2: Make Your First API Call
The io.intelligence API is fully OpenAI-compatible. If you have existing code that calls the OpenAI API, you only need to change two lines: the base URL and the API key.
Using cURL:
curl https://api.intelligence.io.solutions/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $IOINTELLIGENCE_API_KEY" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Explain the difference between GPU and CPU for AI workloads in 3 sentences."
}
],
"temperature": 0.7,
"max_tokens": 256
}'
Using the OpenAI Python SDK (drop-in replacement):
from openai import OpenAI
client = OpenAI(
base_url="https://api.intelligence.io.solutions/api/v1",
api_key="your-io-intelligence-api-key",
)
completion = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is decentralized GPU computing?"},
],
temperature=0.7,
max_tokens=512,
)
print(completion.choices[0].message.content)
Using raw HTTP requests:
import requests
API_KEY = "your-io-intelligence-api-key"
BASE_URL = "https://api.intelligence.io.solutions/api/v1"
response = requests.post(
f"{BASE_URL}/chat/completions",
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}",
},
json={
"model": "meta-llama/Llama-3.3-70B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain tensor parallelism in three sentences."},
],
"temperature": 0.7,
"max_tokens": 256,
},
)
result = response.json()
print(result["choices"][0]["message"]["content"])
Step 3: Enable Streaming for Chat Applications
For real-time UIs, enable streaming to receive tokens as they are generated:
from openai import OpenAI
client = OpenAI(
base_url="https://api.intelligence.io.solutions/api/v1",
api_key="your-io-intelligence-api-key",
)
stream = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "user", "content": "Write a Python function to calculate Fibonacci numbers."}
],
stream=True,
max_tokens=1024,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Available Llama Models on io.intelligence
io.intelligence hosts multiple Llama variants, including Llama 3.1, 3.2, and 3.3 models. Query the models endpoint for the current list:
curl https://api.intelligence.io.solutions/api/v1/models \
-H "Authorization: Bearer $IOINTELLIGENCE_API_KEY"
The free tier includes daily usage limits that vary by model -- sufficient for development and testing. No credit card is required to get started.
Framework compatibility: Because io.intelligence exposes an OpenAI-compatible API, any framework that supports the OpenAI format works out of the box. This includes LangChain, LlamaIndex, Haystack, Semantic Kernel, and others. Change the base URL and API key -- everything else stays the same.
Option 2: Deploy with vLLM on io.cloud (Full Control)
When you need full control over your Llama 3 deployment -- custom fine-tuned weights, specific quantization settings, dedicated GPU resources, or guaranteed throughput -- deploy your own vLLM inference server on io.cloud.
vLLM is the industry-standard open-source inference engine for serving large language models. It delivers up to 24x higher throughput than naive HuggingFace Transformers serving through PagedAttention, continuous batching, and optimized CUDA kernels.
When to Use This Option
- Serving custom or fine-tuned Llama 3 weights (LoRA, merged checkpoints)
- Dedicated GPU resources with predictable latency
- High-throughput production workloads where per-token API pricing is uneconomical
- Data privacy requirements -- prompts and responses never leave your container
Step 1: Select Your GPU and Deploy a Container on io.cloud
- Log in to io.net/cloud.
- Click Deploy and select Container (CaaS) as the deployment type.
- Select GPU hardware:
- For Llama 3 8B: Choose 1x NVIDIA RTX 4090
- For Llama 3 70B (quantized): Choose 1x NVIDIA A100 80 GB
- For Llama 3 70B (FP16): Choose 2x NVIDIA A100 80 GB
- Choose a region. Select the geographic region closest to your users for the lowest latency. io.net supports 130+ countries.
- Set connectivity tier. Choose "High Speed" or "Ultra High Speed" (1600 MB/s) for faster model weight downloads.
- Configure the container. Expose port 8000 (vLLM's default). Use a CUDA-enabled base image (e.g.,
nvidia/cuda:12.4.0-devel-ubuntu22.04) or a pre-built PyTorch image. - Set duration. Choose hourly, daily, or weekly billing.
- Review and deploy. Confirm the summary (GPU type, quantity, duration, total cost) and click Deploy Cluster.
Your container will be ready in under 2 minutes.
Step 2: Connect to Your Container
Once the cluster is deployed, access connection details through your io.net dashboard under Monitor & Manage Clusters. io.net provides SSH access and endpoint information for your running container.
Note: Check your io.net dashboard for the exact SSH command or web terminal URL specific to your deployment.
Step 3: Install vLLM and Download the Model
Connect to your container and run:
# Install vLLM
pip install vllm
# Verify GPU is detected
python3 -c "import torch; print(f'GPUs: {torch.cuda.device_count()}, Device: {torch.cuda.get_device_name(0)}')"
# Authenticate with Hugging Face (required for gated Llama models)
pip install huggingface_hub
huggingface-cli login
# Paste your Hugging Face token when prompted
vLLM downloads model weights automatically from Hugging Face on first run. For faster cold starts, pre-download:
# Optional: Pre-download the model weights
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir /models/llama-3-8b
Alternatively, use the official vLLM Docker image which comes pre-configured:
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
vllm/vllm-openai:latest \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--max-model-len 8192 \
--dtype auto
Step 4: Start the vLLM Server
For Llama 3 8B on a single RTX 4090:
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--dtype auto
For Llama 3 70B (AWQ quantized) on a single A100 80 GB:
python3 -m vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--gpu-memory-utilization 0.92 \
--quantization awq \
--dtype float16
For Llama 3 70B (FP16) on 2x A100 80 GB with tensor parallelism:
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--dtype auto
Once you see Uvicorn running on http://0.0.0.0:8000, your inference server is live. The first run downloads model weights (10-30 minutes depending on model size and network speed). Subsequent restarts use cached weights.
Step 5: Test Your Inference Endpoint
curl http://YOUR_CONTAINER_IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [
{"role": "user", "content": "What are the benefits of decentralized GPU clouds?"}
],
"temperature": 0.7,
"max_tokens": 256
}'
Or from Python using the same OpenAI SDK pattern:
from openai import OpenAI
client = OpenAI(
base_url="http://YOUR_CONTAINER_IP:8000/v1",
api_key="not-needed" # vLLM does not require auth by default
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "user", "content": "What are the benefits of decentralized GPU clouds?"}
],
max_tokens=256,
)
print(response.choices[0].message.content)
Replace YOUR_CONTAINER_IP with the public IP or hostname shown in your io.net dashboard.

Option 3: Deploy with a Ray Cluster (Distributed / Large Models)
For the Llama 3.1 405B model -- or any configuration where a single node does not have enough VRAM -- you need distributed inference across multiple GPUs. io.net's Ray cluster deployment handles the orchestration automatically.
When to Use This Option
- Serving Llama 3.1 405B (requires 8+ H100 GPUs)
- Running Llama 3 70B at full FP16 precision across 2+ GPUs
- Multi-node tensor parallelism for maximum throughput
- Training or fine-tuning workloads
- Complex ML pipelines with multiple stages
Step 1: Create a Ray Cluster on io.cloud
- Log in to io.net/cloud.
- Click Deploy and select Ray from the cluster type menu.
- Select cluster type:
- Inference -- for production-ready, low-latency serving.
- Train -- for fine-tuning or training workloads.
- General -- for prototyping and end-to-end experimentation.
- Name your cluster using the pencil icon.
- Configure security. Select E2E Encrypted for secure data transfer between GPUs.
- Select location. For distributed inference, choose a single region to minimize inter-node latency.
- Set connectivity tier. Choose Ultra High Speed (1600 MB/s) -- fast interconnects are critical for multi-GPU tensor parallelism.
- Select GPU hardware. Choose NVIDIA H100 80 GB. For the 405B model in FP16, select 8 GPUs. For FP8 quantized, 4 GPUs may suffice.
- Set duration and deploy. Review total cost, confirm, and click Deploy Cluster.
io.net pre-selects reliable, security-vetted master nodes and automatically injects the required environment variables (MASTER_ADDR, MASTER_PORT, RANK) for distributed communication.
Step 2: Configure vLLM with Tensor Parallelism
Once your Ray cluster is running, connect via the dashboard and install vLLM:
pip install "vllm[ray]"
Start the inference server with tensor parallelism across all available GPUs:
For Llama 3.1 405B on 8x H100:
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-405B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--dtype bfloat16
For Llama 3.1 405B (FP8 quantized) on 4x H100:
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--max-model-len 8192 \
--quantization fp8 \
--dtype auto
vLLM uses Ray for distributed orchestration and NCCL for inter-GPU communication automatically. No manual Ray cluster configuration is required.
Step 3: Test the Distributed Endpoint
curl http://<cluster-endpoint>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-405B-Instruct",
"messages": [
{"role": "user", "content": "Summarize the key innovations in transformer architecture since 2017."}
],
"max_tokens": 512,
"temperature": 0.7
}'
Replace <cluster-endpoint> with the public IP or hostname from your io.net dashboard.
Performance Tuning Tips
Once your Llama 3 deployment is running, these optimizations can improve throughput by 2-5x and reduce per-request costs.
Quantization: Fit Larger Models on Smaller GPUs
Quantization reduces model precision from FP16 to INT8 or INT4, cutting memory requirements by 50-75% with minimal quality degradation.
| Quantization | Memory Savings | Quality Impact | Best For |
|---|---|---|---|
| FP16 (default) | Baseline | Full quality | Production benchmarks |
| FP8 | ~50% | Negligible | H100 production workloads |
| AWQ (INT4) | ~75% | Minor for most tasks | Fitting 70B on a single A100 |
| GPTQ (INT4) | ~75% | Minor for most tasks | Cost-optimized inference |
Deploy a quantized model with vLLM:
python3 -m vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.92
KV Cache Optimization
vLLM's PagedAttention manages KV cache memory efficiently by default. Fine-tune with these flags:
--gpu-memory-utilization 0.90-- Controls how much GPU memory vLLM reserves. Increase to 0.95 for maximum model capacity; decrease if you encounter OOM errors during bursts.--max-model-len-- Set to the maximum sequence length you actually need. Shorter context windows reduce KV cache memory, allowing more concurrent requests.
Continuous Batching
vLLM enables continuous batching by default. Unlike static batching, it does not wait for the slowest request to finish before starting new ones. This alone can improve throughput by 2-5x over naive serving.
For high-throughput production workloads, tune these parameters:
--max-num-seqs 256 # Maximum concurrent sequences
--max-num-batched-tokens 32768 # Maximum tokens per batch iteration
--enable-prefix-caching # Cache shared system prompts
Speculative Decoding
For 70B and 405B models, speculative decoding uses a smaller "draft" model to propose tokens, which the larger model verifies in parallel:
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-70B-Instruct \
--speculative-model meta-llama/Meta-Llama-3-8B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 2
This can improve generation speed by 1.5-2x for certain workloads without affecting output quality.
Cost Comparison: io.net vs. Traditional Cloud Providers
One of the primary reasons to deploy Llama 3 on io.net's GPU cloud is cost. Here is a side-by-side comparison for common configurations:
| Configuration | io.net Cost/hr | AWS Cost/hr | Monthly Savings (24/7) |
|---|---|---|---|
| Llama 3 8B on 1x RTX 4090 | ~$0.50 | ~$1.50* | ~$720/mo |
| Llama 3 70B on 1x A100 80 GB | ~$1.50 | ~$5.12 | ~$2,607/mo |
| Llama 3 70B on 2x A100 80 GB | ~$3.00 | ~$10.24 | ~$5,213/mo |
| Llama 3.1 405B on 8x H100 | ~$20.00 | ~$98.32 | ~$56,390/mo |
AWS does not offer standalone RTX 4090 instances; comparable pricing estimated from similar-tier offerings.
Note: Prices are approximate and vary by region, availability, and billing tier. Check io.net/cloud for current pricing. io.net charges no long-term commitments -- scale to zero when idle.
API vs. Self-Hosted Break-Even
Using the io.intelligence managed API is simpler but costs more at high volumes. The break-even depends on request volume:
- Under 1M tokens/day -- managed API is typically more cost-effective (no idle GPU time).
- 1-10M tokens/day -- self-hosted on a single GPU starts to win.
- Over 10M tokens/day -- self-hosted is significantly cheaper; consider multi-GPU setups.
Frequently Asked Questions
How long does it take to deploy Llama 3 on io.net?
Using io.intelligence (Option 1), you can start making API calls within 2 minutes of creating your account. For self-hosted deployments (Options 2 and 3), the GPU cluster provisions in under 2 minutes, but downloading model weights adds 10-30 minutes on first deployment depending on model size. The 8B model downloads quickly; the 405B model is approximately 800 GB and takes longer.
Can I use my own fine-tuned Llama 3 model on io.net?
Yes. Options 2 and 3 give you full control over which model weights you load. Upload your fine-tuned checkpoint to Hugging Face (or any accessible storage), then point vLLM's --model flag at your custom model path. The io.intelligence API (Option 1) only supports the standard pre-trained variants hosted by io.net.
What is the difference between io.intelligence and io.cloud?
io.intelligence is a managed inference API -- io.net hosts and serves the model, and you call it via an OpenAI-compatible HTTP endpoint. You pay per usage and never touch a GPU. io.cloud gives you raw GPU compute -- containers, Ray clusters, VMs, or bare metal -- where you deploy and manage your own software stack. Use io.intelligence for simplicity and fast iteration; use io.cloud for full control, custom models, and cost optimization at scale.
Can I serve Llama 3 70B on a single GPU?
Yes, with quantization. An AWQ or GPTQ quantized (INT4) version of Llama 3 70B requires approximately 36 GB of VRAM, which fits on a single A100 80 GB GPU. Full FP16 precision requires approximately 140 GB of VRAM, so you would need 2x A100 80 GB or 2x H100. For most production applications, the quality difference between AWQ-quantized and full-precision inference is minimal.
Is io.net compatible with LangChain, LlamaIndex, and other AI frameworks?
Yes. io.intelligence exposes a fully OpenAI-compatible API, so any framework that supports the OpenAI API format works without modification. This includes LangChain, LlamaIndex, Haystack, Semantic Kernel, AutoGen, and CrewAI. Simply configure the base URL to https://api.intelligence.io.solutions/api/v1 and provide your io.net API key.
Do I need to worry about GPU availability?
io.net aggregates 300,000+ GPUs across 130+ countries, so availability is significantly better than centralized providers. There are no waitlists or approval gates. Specific GPU types (like H100) may have variable availability depending on region and demand. Selecting multiple regions during deployment increases your chances of instant provisioning.
Conclusion
Deploying Llama 3 on a GPU cloud no longer requires deep infrastructure expertise or a massive budget. With io.net, you have three clear paths matched to different stages of development:
- io.intelligence API -- Start here. Two lines of code, zero infrastructure, free daily limits. Ideal for prototyping and moderate-volume production inference.
- vLLM on io.cloud -- Move here when you need custom models, guaranteed throughput, or data privacy. A100 80 GB GPUs at $1.20-$2.00/hr make self-hosting financially viable even for smaller teams.
- Ray clusters on io.cloud -- Scale here for 405B models or high-throughput production systems requiring tensor parallelism across multiple nodes.
All three options run on io.net's global network of 300,000+ verified GPUs at up to 70% lower cost than AWS or GCP, with no waitlists and no long-term contracts.
Ready to deploy Llama 3? Sign up for io.net and launch your first cluster in under two minutes. Or start even faster with io.intelligence -- make your first Llama 3 API call in seconds with free credits.