FAQ: How do I monitor GPU usage on io.net?

io.net provides real-time GPU monitoring through a web dashboard, CLI commands, and Prometheus/Grafana integration. Track GPU utilization, memory usage, temperature, power consumption, and cost per instance with per-second granularity. All metrics are available via REST API for custom monitoring solutions and alerting.

Access monitoring instantly—no additional configuration required. The dashboard displays live metrics for all running instances, while the CLI offers terminal-based monitoring and scripting support.

Web Dashboard Monitoring

Access: https://cloud.io.net/dashboard

Available Metrics:
- GPU utilization (0-100%)
- VRAM usage (current / total GB)
- GPU temperature (°C)
- Power consumption (W / max W)
- PCIe bandwidth utilization
- Cost accumulation (real-time)
- Instance uptime

Features:
- Real-time updates (1-second refresh)
- Historical graphs (1 hour, 24 hours, 7 days)
- Multi-instance comparison view
- Export metrics to CSV
- Set custom alerts (utilization thresholds, cost limits)

CLI Monitoring

Basic GPU Stats:

# View current GPU utilization
io gpu-stats my-instance

# Output:
# GPU  | Util | Memory      | Temp | Power  | Cost/hr
# ---- | ---- | ----------- | ---- | ------ | -------
# 0    | 92%  | 38GB / 80GB | 72°C | 320W   | $1.10
# 1    | 89%  | 42GB / 80GB | 70°C | 310W   | $1.10

Real-Time Monitoring (nvidia-smi):

# SSH into instance and run nvidia-smi
io exec my-instance -- nvidia-smi --loop=1

# Continuous GPU monitoring every 1 second
# Shows: utilization, memory, temperature, processes

Detailed Metrics:

# Get JSON metrics for scripting
io metrics my-instance --format json

# Output:
{
  "gpu_utilization_percent": 92,
  "memory_used_gb": 38.2,
  "memory_total_gb": 80.0,
  "temperature_c": 72,
  "power_watts": 320,
  "power_limit_watts": 400,
  "cost_per_hour": 1.10,
  "uptime_seconds": 3620
}

Prometheus Integration

Metrics Endpoint:

# io.net exposes Prometheus-compatible metrics
curl https://metrics.ionet.cloud/instance/my-instance/metrics

# Sample output:
gpu_utilization_percent{instance="my-instance",gpu="0"} 92
gpu_memory_used_bytes{instance="my-instance",gpu="0"} 40960000000
gpu_temperature_celsius{instance="my-instance",gpu="0"} 72
gpu_power_watts{instance="my-instance",gpu="0"} 320

Prometheus Configuration:

# prometheus.yml
scrape_configs:
  - job_name: 'ionet-gpus'
    static_configs:
      - targets: ['metrics.ionet.cloud']
    metrics_path: '/instance/my-instance/metrics'
    scrape_interval: 10s

Grafana Dashboard

Pre-Built Dashboard:

# Import io.net Grafana dashboard
# Dashboard ID: 18426 (Grafana.com)

# Or deploy Grafana with io.net dashboard
io deploy --image grafana/grafana:latest \
  --port 3000 \
  --env GF_INSTALL_PLUGINS=ionet-datasource \
  --name monitoring

# Access: https://xxx.ionet.cloud:3000
# Default credentials: admin / admin

Dashboard Panels:
1. GPU Utilization (multi-instance)
2. Memory Usage Timeline
3. Temperature Heatmap
4. Power Consumption
5. Cost Tracker (per instance, total)
6. Process List (what's running on each GPU)

Custom Monitoring Script

Python Script:

# monitor.py
import requests
import time

def get_metrics(instance_name):
    url = f"https://api.io.net/v1/instances/{instance_name}/metrics"
    headers = {"Authorization": f"Bearer {API_TOKEN}"}
    response = requests.get(url, headers=headers)
    return response.json()

# Continuous monitoring
while True:
    metrics = get_metrics("my-instance")

    print(f"GPU Utilization: {metrics['gpu_utilization_percent']}%")
    print(f"VRAM: {metrics['memory_used_gb']:.1f}GB / {metrics['memory_total_gb']}GB")
    print(f"Temp: {metrics['temperature_c']}°C")
    print(f"Cost: ${metrics['cost_accumulated']:.2f}")
    print("---")

    time.sleep(5)

Alerting

Set Utilization Alert:

# Alert if GPU utilization < 50% for 10 minutes (underutilization)
io alert create \
  --instance my-instance \
  --metric gpu_utilization_percent \
  --condition "< 50" \
  --duration 10m \
  --action email --recipient [email protected]

# Alert if cost exceeds budget
io alert create \
  --instance my-instance \
  --metric cost_accumulated \
  --condition "> 100" \
  --action webhook --url https://slack.com/api/webhook/xxx

Slack Integration:

# Send GPU metrics to Slack
io alert create \
  --instance my-instance \
  --metric temperature_c \
  --condition "> 80" \
  --action slack \
  --webhook-url https://hooks.slack.com/services/YOUR/WEBHOOK/URL

Cost Tracking

Real-Time Cost:

# View current cost accumulation
io cost my-instance

# Output:
# Instance: my-instance
# GPU: A100 (80GB)
# Uptime: 6h 23m
# Rate: $1.10/hour
# Cost so far: $7.02
# Projected 24h cost: $26.40

Cost Breakdown (Multiple Instances):

# All instances
io cost --all

# Output:
# Instance          | GPU Type | Uptime | Rate/hr | Cost
# ----------------- | -------- | ------ | ------- | ------
# training-job      | 8x A100  | 12h    | $8.80   | $105.60
# inference-api     | RTX 4090 | 3d 2h  | $0.18   | $13.32
# dev-notebook      | A100     | 45m    | $1.10   | $0.83
# ----------------- | -------- | ------ | ------- | ------
# Total                                             $119.75

Budget Alerts:

# Auto-stop instance when budget exceeded
io alert create \
  --instance training-job \
  --metric cost_accumulated \
  --condition "> 200" \
  --action stop-instance

Process Monitoring

See what's running on GPU:

# List GPU processes
io exec my-instance -- nvidia-smi pmon

# Output:
# gpu   pid   type   sm   mem   enc   dec   command
#   0  1234      C   95%   45%     0     0   python train.py
#   1  1234      C   92%   48%     0     0   python train.py

Kill specific process:

# Kill process using GPU
io exec my-instance -- kill -9 1234

Performance Profiling

NVIDIA Nsight Systems:

# Profile GPU workload
io exec my-instance -- nsys profile --trace cuda,nvtx python train.py

# Download profile for analysis
io download my-instance:/workspace/report.nsys-rep ./
# Open in NVIDIA Nsight Systems (desktop app)

PyTorch Profiler:

# profile_training.py
import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('/workspace/logs'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    train_model()

# View in TensorBoard
# io.net dashboard includes TensorBoard integration

Multi-GPU Monitoring

Cluster View:

# Monitor all GPUs in cluster
io cluster-metrics training-cluster

# Output:
# Node          | GPU | Util | Memory    | Temp | Status
# ------------- | --- | ---- | --------- | ---- | ------
# worker-0      | 0   | 98%  | 78GB/80GB | 76°C | Healthy
# worker-0      | 1   | 97%  | 79GB/80GB | 75°C | Healthy
# worker-1      | 0   | 96%  | 77GB/80GB | 74°C | Healthy
# worker-1      | 1   | 98%  | 79GB/80GB | 77°C | Healthy

Best Practices

1. Set utilization alerts:

# Alert if GPU idle (wasting money)
io alert create --metric gpu_utilization_percent \
  --condition "< 20" --duration 15m --action email

2. Monitor training progress:

# Log metrics to io.net dashboard
import io_sdk

metrics = io_sdk.MetricsLogger()
for epoch in range(num_epochs):
    loss = train_epoch()
    metrics.log("training_loss", loss, step=epoch)
    # Visible in io.net dashboard

3. Track cost vs. budget:

# Daily cost report
io cost --all --period today --format email --send-to [email protected]

Monitor your GPUs on io.net with real-time dashboards, Prometheus integration, and cost tracking.