io.net provides real-time GPU monitoring through a web dashboard, CLI commands, and Prometheus/Grafana integration. Track GPU utilization, memory usage, temperature, power consumption, and cost per instance with per-second granularity. All metrics are available via REST API for custom monitoring solutions and alerting.
Access monitoring instantly—no additional configuration required. The dashboard displays live metrics for all running instances, while the CLI offers terminal-based monitoring and scripting support.
Web Dashboard Monitoring
Access: https://cloud.io.net/dashboard
Available Metrics:
- GPU utilization (0-100%)
- VRAM usage (current / total GB)
- GPU temperature (°C)
- Power consumption (W / max W)
- PCIe bandwidth utilization
- Cost accumulation (real-time)
- Instance uptime
Features:
- Real-time updates (1-second refresh)
- Historical graphs (1 hour, 24 hours, 7 days)
- Multi-instance comparison view
- Export metrics to CSV
- Set custom alerts (utilization thresholds, cost limits)
CLI Monitoring
Basic GPU Stats:
# View current GPU utilization
io gpu-stats my-instance
# Output:
# GPU | Util | Memory | Temp | Power | Cost/hr
# ---- | ---- | ----------- | ---- | ------ | -------
# 0 | 92% | 38GB / 80GB | 72°C | 320W | $1.10
# 1 | 89% | 42GB / 80GB | 70°C | 310W | $1.10
Real-Time Monitoring (nvidia-smi):
# SSH into instance and run nvidia-smi
io exec my-instance -- nvidia-smi --loop=1
# Continuous GPU monitoring every 1 second
# Shows: utilization, memory, temperature, processes
Detailed Metrics:
# Get JSON metrics for scripting
io metrics my-instance --format json
# Output:
{
"gpu_utilization_percent": 92,
"memory_used_gb": 38.2,
"memory_total_gb": 80.0,
"temperature_c": 72,
"power_watts": 320,
"power_limit_watts": 400,
"cost_per_hour": 1.10,
"uptime_seconds": 3620
}
Prometheus Integration
Metrics Endpoint:
# io.net exposes Prometheus-compatible metrics
curl https://metrics.ionet.cloud/instance/my-instance/metrics
# Sample output:
gpu_utilization_percent{instance="my-instance",gpu="0"} 92
gpu_memory_used_bytes{instance="my-instance",gpu="0"} 40960000000
gpu_temperature_celsius{instance="my-instance",gpu="0"} 72
gpu_power_watts{instance="my-instance",gpu="0"} 320
Prometheus Configuration:
# prometheus.yml
scrape_configs:
- job_name: 'ionet-gpus'
static_configs:
- targets: ['metrics.ionet.cloud']
metrics_path: '/instance/my-instance/metrics'
scrape_interval: 10s
Grafana Dashboard
Pre-Built Dashboard:
# Import io.net Grafana dashboard
# Dashboard ID: 18426 (Grafana.com)
# Or deploy Grafana with io.net dashboard
io deploy --image grafana/grafana:latest \
--port 3000 \
--env GF_INSTALL_PLUGINS=ionet-datasource \
--name monitoring
# Access: https://xxx.ionet.cloud:3000
# Default credentials: admin / admin
Dashboard Panels:
1. GPU Utilization (multi-instance)
2. Memory Usage Timeline
3. Temperature Heatmap
4. Power Consumption
5. Cost Tracker (per instance, total)
6. Process List (what's running on each GPU)
Custom Monitoring Script
Python Script:
# monitor.py
import requests
import time
def get_metrics(instance_name):
url = f"https://api.io.net/v1/instances/{instance_name}/metrics"
headers = {"Authorization": f"Bearer {API_TOKEN}"}
response = requests.get(url, headers=headers)
return response.json()
# Continuous monitoring
while True:
metrics = get_metrics("my-instance")
print(f"GPU Utilization: {metrics['gpu_utilization_percent']}%")
print(f"VRAM: {metrics['memory_used_gb']:.1f}GB / {metrics['memory_total_gb']}GB")
print(f"Temp: {metrics['temperature_c']}°C")
print(f"Cost: ${metrics['cost_accumulated']:.2f}")
print("---")
time.sleep(5)
Alerting
Set Utilization Alert:
# Alert if GPU utilization < 50% for 10 minutes (underutilization)
io alert create \
--instance my-instance \
--metric gpu_utilization_percent \
--condition "< 50" \
--duration 10m \
--action email --recipient [email protected]
# Alert if cost exceeds budget
io alert create \
--instance my-instance \
--metric cost_accumulated \
--condition "> 100" \
--action webhook --url https://slack.com/api/webhook/xxx
Slack Integration:
# Send GPU metrics to Slack
io alert create \
--instance my-instance \
--metric temperature_c \
--condition "> 80" \
--action slack \
--webhook-url https://hooks.slack.com/services/YOUR/WEBHOOK/URL
Cost Tracking
Real-Time Cost:
# View current cost accumulation
io cost my-instance
# Output:
# Instance: my-instance
# GPU: A100 (80GB)
# Uptime: 6h 23m
# Rate: $1.10/hour
# Cost so far: $7.02
# Projected 24h cost: $26.40
Cost Breakdown (Multiple Instances):
# All instances
io cost --all
# Output:
# Instance | GPU Type | Uptime | Rate/hr | Cost
# ----------------- | -------- | ------ | ------- | ------
# training-job | 8x A100 | 12h | $8.80 | $105.60
# inference-api | RTX 4090 | 3d 2h | $0.18 | $13.32
# dev-notebook | A100 | 45m | $1.10 | $0.83
# ----------------- | -------- | ------ | ------- | ------
# Total $119.75
Budget Alerts:
# Auto-stop instance when budget exceeded
io alert create \
--instance training-job \
--metric cost_accumulated \
--condition "> 200" \
--action stop-instance
Process Monitoring
See what's running on GPU:
# List GPU processes
io exec my-instance -- nvidia-smi pmon
# Output:
# gpu pid type sm mem enc dec command
# 0 1234 C 95% 45% 0 0 python train.py
# 1 1234 C 92% 48% 0 0 python train.py
Kill specific process:
# Kill process using GPU
io exec my-instance -- kill -9 1234
Performance Profiling
NVIDIA Nsight Systems:
# Profile GPU workload
io exec my-instance -- nsys profile --trace cuda,nvtx python train.py
# Download profile for analysis
io download my-instance:/workspace/report.nsys-rep ./
# Open in NVIDIA Nsight Systems (desktop app)
PyTorch Profiler:
# profile_training.py
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA
],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
on_trace_ready=torch.profiler.tensorboard_trace_handler('/workspace/logs'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
train_model()
# View in TensorBoard
# io.net dashboard includes TensorBoard integration
Multi-GPU Monitoring
Cluster View:
# Monitor all GPUs in cluster
io cluster-metrics training-cluster
# Output:
# Node | GPU | Util | Memory | Temp | Status
# ------------- | --- | ---- | --------- | ---- | ------
# worker-0 | 0 | 98% | 78GB/80GB | 76°C | Healthy
# worker-0 | 1 | 97% | 79GB/80GB | 75°C | Healthy
# worker-1 | 0 | 96% | 77GB/80GB | 74°C | Healthy
# worker-1 | 1 | 98% | 79GB/80GB | 77°C | Healthy
Best Practices
1. Set utilization alerts:
# Alert if GPU idle (wasting money)
io alert create --metric gpu_utilization_percent \
--condition "< 20" --duration 15m --action email
2. Monitor training progress:
# Log metrics to io.net dashboard
import io_sdk
metrics = io_sdk.MetricsLogger()
for epoch in range(num_epochs):
loss = train_epoch()
metrics.log("training_loss", loss, step=epoch)
# Visible in io.net dashboard
3. Track cost vs. budget:
# Daily cost report
io cost --all --period today --format email --send-to [email protected]
Monitor your GPUs on io.net with real-time dashboards, Prometheus integration, and cost tracking.
