Yes. io.net is optimized for AI agent inference workloads requiring low-latency, auto-scaling GPU infrastructure. Deploy agent frameworks like LangChain, AutoGPT, or CrewAI on single GPUs (RTX 4090 at $0.18/hr) or multi-GPU clusters with automatic horizontal scaling based on request volume. io.net handles inference serving, memory management, and load balancing—perfect for production AI agent applications.
AI agents benefit from io.net's pay-per-use pricing model: scale from 1 to 100+ GPUs as agent traffic grows, with sub-2-minute provisioning for new capacity. Typical latency is 50-100ms for 7B-13B models, supporting 20-50 concurrent agent conversations per GPU.
AI Agent Inference Architecture
Single Agent (Basic):
User Request → LLM Inference (RTX 4090) → Response
Cost: $0.18/hour
Concurrency: 20-30 conversations
Latency: 80-120ms per turn
Multi-Agent System (Advanced):
User Request → Orchestrator Agent
├→ Researcher Agent (A100) → Web search + summarization
├→ Code Agent (A100) → Code generation + execution
└→ Writer Agent (A100) → Final output synthesis
Cost: $3.30/hour (3x A100)
Concurrency: 100+ conversations
Latency: 200-400ms per multi-agent workflow
Quick Start: Deploy AI Agent
Example: LangChain Agent with Llama 3
# agent.py
from langchain.agents import initialize_agent, Tool
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# Load model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B-Instruct",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
# Create pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=512,
temperature=0.7
)
llm = HuggingFacePipeline(pipeline=pipe)
# Define agent tools
tools = [
Tool(
name="Calculator",
func=lambda x: eval(x),
description="Useful for math calculations"
),
Tool(
name="Search",
func=search_web, # Your search function
description="Search the web for information"
)
]
# Initialize agent
agent = initialize_agent(
tools=tools,
llm=llm,
agent="zero-shot-react-description",
verbose=True
)
# Run agent
response = agent.run("What is 25 * 4, and what's the weather in San Francisco?")
print(response)
Deploy on io.net:
io deploy --image langchain:latest \
--gpu RTX4090 \
--port 8000 \
--command "python agent.py" \
--name langchain-agent
# Cost: $0.18/hour
# Throughput: 25-30 requests/minute
Production AI Agent Deployment
API Server with vLLM (Recommended):
# serve_agent.py
from fastapi import FastAPI
from vllm import LLM, SamplingParams
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import BaseTool
import asyncio
app = FastAPI()
# Load model with vLLM for high throughput
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
tensor_parallel_size=1,
max_model_len=8192
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.95,
max_tokens=512
)
class WebSearchTool(BaseTool):
name = "web_search"
description = "Search the web for information"
def _run(self, query: str) -> str:
# Your search implementation
return search_results
class CalculatorTool(BaseTool):
name = "calculator"
description = "Perform mathematical calculations"
def _run(self, expression: str) -> str:
return str(eval(expression))
# Initialize agent
tools = [WebSearchTool(), CalculatorTool()]
agent = create_react_agent(llm, tools)
agent_executor = AgentExecutor(agent=agent, tools=tools)
@app.post("/agent")
async def run_agent(request: dict):
query = request["query"]
result = await agent_executor.arun(query)
return {"response": result}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Deploy with auto-scaling:
io deploy --image agent-api:latest \
--gpu A100 \
--replicas 2 --autoscale min=1,max=10 \
--port 8000 \
--name production-agent
# Auto-scales based on request queue depth
# Cost: $1.10-11.00/hour (dynamic based on load)
Multi-Agent Orchestration
CrewAI Example:
# crew_agents.py
from crewai import Agent, Task, Crew
from langchain_community.llms import VLLMOpenAI
# Connect to io.net vLLM endpoints
llm = VLLMOpenAI(
openai_api_key="not-needed",
openai_api_base="https://xxx.ionet.cloud/v1",
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
)
# Define agents
researcher = Agent(
role="Researcher",
goal="Research and gather information on given topics",
backstory="Expert at finding and synthesizing information",
llm=llm,
verbose=True
)
writer = Agent(
role="Writer",
goal="Write comprehensive articles based on research",
backstory="Professional content writer with SEO expertise",
llm=llm,
verbose=True
)
editor = Agent(
role="Editor",
goal="Review and improve written content",
backstory="Detail-oriented editor focused on quality",
llm=llm,
verbose=True
)
# Define tasks
research_task = Task(
description="Research the topic: {topic}",
agent=researcher,
expected_output="Comprehensive research summary"
)
write_task = Task(
description="Write an article based on research",
agent=writer,
expected_output="2000-word article"
)
edit_task = Task(
description="Edit and polish the article",
agent=editor,
expected_output="Final polished article"
)
# Create crew
crew = Crew(
agents=[researcher, writer, editor],
tasks=[research_task, write_task, edit_task],
verbose=True
)
# Run multi-agent workflow
result = crew.kickoff(inputs={"topic": "Decentralized GPU Computing"})
print(result)
Deploy multi-agent system:
# Deploy 3 separate LLM endpoints (one per agent)
io deploy --image vllm/vllm-openai:latest \
--gpu A100 --env MODEL=meta-llama/Meta-Llama-3-8B-Instruct \
--replicas 3 --name agent-llm-pool
# Deploy orchestrator
io deploy --image crewai:latest \
--command "python crew_agents.py" \
--env LLM_ENDPOINT=https://xxx.ionet.cloud \
--name crew-orchestrator
# Total cost: ~$3.30/hour (3x A100)
# Handles 50+ concurrent multi-agent workflows
Memory Management for Agents
Vector Database Integration:
# agent_with_memory.py
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import VectorStoreMemoryRetriever
from langchain.agents import initialize_agent
# Initialize embeddings (GPU-accelerated)
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2",
model_kwargs={'device': 'cuda'}
)
# Create vector store for agent memory
vectorstore = Chroma(
collection_name="agent_memory",
embedding_function=embeddings,
persist_directory="/data/chroma"
)
# Memory retriever
memory = VectorStoreMemoryRetriever(
vectorstore=vectorstore,
search_kwargs={"k": 5}
)
# Agent with long-term memory
agent = initialize_agent(
tools=tools,
llm=llm,
agent="conversational-react-description",
memory=memory,
verbose=True
)
# Agent remembers context across conversations
response1 = agent.run("My name is Alex and I work on AI research")
# Later conversation:
response2 = agent.run("What do you know about me?")
# Agent recalls: "You're Alex, working on AI research"
Performance Optimization
Batched Inference for Multiple Agents:
# batch_agents.py
from vllm import LLM, SamplingParams
import asyncio
llm = LLM(
model="meta-llama/Meta-Llama-3-8B-Instruct",
max_model_len=8192
)
async def process_agent_requests(requests):
# Batch 10 agent queries together
prompts = [r["prompt"] for r in requests]
# vLLM automatically batches for efficiency
outputs = llm.generate(prompts, sampling_params)
return [{"response": o.outputs[0].text} for o in outputs]
# Throughput improvement:
# Sequential: 25 requests/minute
# Batched: 80 requests/minute (3.2x faster)
GPU Selection by Agent Complexity:
| Agent Type | Model Size | Recommended GPU | Cost/Hour | Latency |
|---|---|---|---|---|
| Simple Q&A | 7B | RTX 4090 | $0.18 | 50-80ms |
| RAG Agent | 8B-13B | RTX 4090 or A100 | $0.18-1.10 | 80-120ms |
| Code Agent | 13B-34B | A100 | $1.10 | 150-250ms |
| Multi-Agent | 8B-70B | 2-4x A100 | $2.20-4.40 | 200-500ms |
| Enterprise | 70B+ | H100 | $1.49-2.20 | 100-200ms |
Auto-Scaling Configuration
# autoscale_config.yaml
deployment:
name: ai-agent-api
gpu: A100
min_replicas: 2
max_replicas: 20
scaling_metrics:
- metric: request_queue_depth
threshold: 50
scale_up_increment: 2
scale_down_delay: 5m
- metric: gpu_utilization
threshold: 80
scale_up_increment: 1
- metric: response_time_p95
threshold: 500ms
scale_up_increment: 2
health_check:
endpoint: /health
interval: 30s
timeout: 5s
Deploy with auto-scaling:
io deploy --config autoscale_config.yaml
# Scaling behavior:
# - Traffic spike: 10 req/s → 100 req/s
# - System auto-scales: 2 → 8 replicas in 3 minutes
# - Traffic drops: 100 → 20 req/s
# - System scales down: 8 → 3 replicas after 5-minute stabilization
Cost Analysis: AI Agent Hosting
Scenario: Customer support chatbot (AI agent with RAG)
Requirements:
- 1,000 conversations/day
- Avg 10 turns per conversation
- Peak: 50 concurrent conversations
io.net Setup:
GPU: 2x RTX 4090 (baseline) + auto-scale to 6x (peak)
Average utilization: 3x RTX 4090
Cost: $0.18/hour × 3 GPUs × 730 hours = $394/month
AWS Setup (equivalent):
2x g5.xlarge (A10G) + auto-scale to 6x
Average utilization: 3x g5.xlarge
Cost: $1.21/hour × 3 × 730 hours = $2,650/month
Savings: $2,256/month (85%)
Real-World Agent Examples
1. Code Generation Agent:
io deploy --image codellama:latest \
--gpu A100 --autoscale min=1,max=5 \
--env AGENT_TYPE=code_assistant \
--name code-agent
# Use case: GitHub Copilot alternative
# Cost: $1.10-5.50/hour (dynamic)
# Supports: 50-250 concurrent developers
2. Research Agent:
io deploy --image research-agent:latest \
--gpu A100 --count 2 \
--env TOOLS=web_search,arxiv,wikipedia \
--name research-agent
# Use case: Automated literature review
# Cost: $2.20/hour
# Throughput: 20-30 research queries/hour
3. Sales AI Agent:
io deploy --image sales-agent:latest \
--gpu RTX4090 --replicas 3 \
--env CRM_INTEGRATION=salesforce \
--name sales-agent
# Use case: Lead qualification and outreach
# Cost: $0.54/hour (3x RTX 4090)
# Handles: 100+ concurrent sales conversations
Deploy AI agents on io.net with auto-scaling, low latency, and 85% cost savings vs. AWS.
