FAQ: Can I use io.net for AI agent inference?

Yes. io.net is optimized for AI agent inference workloads requiring low-latency, auto-scaling GPU infrastructure. Deploy agent frameworks like LangChain, AutoGPT, or CrewAI on single GPUs (RTX 4090 at $0.18/hr) or multi-GPU clusters with automatic horizontal scaling based on request volume. io.net handles inference serving, memory management, and load balancing—perfect for production AI agent applications.

AI agents benefit from io.net's pay-per-use pricing model: scale from 1 to 100+ GPUs as agent traffic grows, with sub-2-minute provisioning for new capacity. Typical latency is 50-100ms for 7B-13B models, supporting 20-50 concurrent agent conversations per GPU.

AI Agent Inference Architecture

Single Agent (Basic):

User Request → LLM Inference (RTX 4090) → Response
Cost: $0.18/hour
Concurrency: 20-30 conversations
Latency: 80-120ms per turn

Multi-Agent System (Advanced):

User Request → Orchestrator Agent
                ├→ Researcher Agent (A100) → Web search + summarization
                ├→ Code Agent (A100) → Code generation + execution
                └→ Writer Agent (A100) → Final output synthesis
Cost: $3.30/hour (3x A100)
Concurrency: 100+ conversations
Latency: 200-400ms per multi-agent workflow

Quick Start: Deploy AI Agent

Example: LangChain Agent with Llama 3

# agent.py
from langchain.agents import initialize_agent, Tool
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

# Create pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7
)

llm = HuggingFacePipeline(pipeline=pipe)

# Define agent tools
tools = [
    Tool(
        name="Calculator",
        func=lambda x: eval(x),
        description="Useful for math calculations"
    ),
    Tool(
        name="Search",
        func=search_web,  # Your search function
        description="Search the web for information"
    )
]

# Initialize agent
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="zero-shot-react-description",
    verbose=True
)

# Run agent
response = agent.run("What is 25 * 4, and what's the weather in San Francisco?")
print(response)

Deploy on io.net:

io deploy --image langchain:latest \
  --gpu RTX4090 \
  --port 8000 \
  --command "python agent.py" \
  --name langchain-agent

# Cost: $0.18/hour
# Throughput: 25-30 requests/minute

Production AI Agent Deployment

API Server with vLLM (Recommended):

# serve_agent.py
from fastapi import FastAPI
from vllm import LLM, SamplingParams
from langchain.agents import AgentExecutor, create_react_agent
from langchain.tools import BaseTool
import asyncio

app = FastAPI()

# Load model with vLLM for high throughput
llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    tensor_parallel_size=1,
    max_model_len=8192
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=512
)

class WebSearchTool(BaseTool):
    name = "web_search"
    description = "Search the web for information"

    def _run(self, query: str) -> str:
        # Your search implementation
        return search_results

class CalculatorTool(BaseTool):
    name = "calculator"
    description = "Perform mathematical calculations"

    def _run(self, expression: str) -> str:
        return str(eval(expression))

# Initialize agent
tools = [WebSearchTool(), CalculatorTool()]
agent = create_react_agent(llm, tools)
agent_executor = AgentExecutor(agent=agent, tools=tools)

@app.post("/agent")
async def run_agent(request: dict):
    query = request["query"]
    result = await agent_executor.arun(query)
    return {"response": result}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Deploy with auto-scaling:

io deploy --image agent-api:latest \
  --gpu A100 \
  --replicas 2 --autoscale min=1,max=10 \
  --port 8000 \
  --name production-agent

# Auto-scales based on request queue depth
# Cost: $1.10-11.00/hour (dynamic based on load)

Multi-Agent Orchestration

CrewAI Example:

# crew_agents.py
from crewai import Agent, Task, Crew
from langchain_community.llms import VLLMOpenAI

# Connect to io.net vLLM endpoints
llm = VLLMOpenAI(
    openai_api_key="not-needed",
    openai_api_base="https://xxx.ionet.cloud/v1",
    model_name="meta-llama/Meta-Llama-3-8B-Instruct"
)

# Define agents
researcher = Agent(
    role="Researcher",
    goal="Research and gather information on given topics",
    backstory="Expert at finding and synthesizing information",
    llm=llm,
    verbose=True
)

writer = Agent(
    role="Writer",
    goal="Write comprehensive articles based on research",
    backstory="Professional content writer with SEO expertise",
    llm=llm,
    verbose=True
)

editor = Agent(
    role="Editor",
    goal="Review and improve written content",
    backstory="Detail-oriented editor focused on quality",
    llm=llm,
    verbose=True
)

# Define tasks
research_task = Task(
    description="Research the topic: {topic}",
    agent=researcher,
    expected_output="Comprehensive research summary"
)

write_task = Task(
    description="Write an article based on research",
    agent=writer,
    expected_output="2000-word article"
)

edit_task = Task(
    description="Edit and polish the article",
    agent=editor,
    expected_output="Final polished article"
)

# Create crew
crew = Crew(
    agents=[researcher, writer, editor],
    tasks=[research_task, write_task, edit_task],
    verbose=True
)

# Run multi-agent workflow
result = crew.kickoff(inputs={"topic": "Decentralized GPU Computing"})
print(result)

Deploy multi-agent system:

# Deploy 3 separate LLM endpoints (one per agent)
io deploy --image vllm/vllm-openai:latest \
  --gpu A100 --env MODEL=meta-llama/Meta-Llama-3-8B-Instruct \
  --replicas 3 --name agent-llm-pool

# Deploy orchestrator
io deploy --image crewai:latest \
  --command "python crew_agents.py" \
  --env LLM_ENDPOINT=https://xxx.ionet.cloud \
  --name crew-orchestrator

# Total cost: ~$3.30/hour (3x A100)
# Handles 50+ concurrent multi-agent workflows

Memory Management for Agents

Vector Database Integration:

# agent_with_memory.py
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.memory import VectorStoreMemoryRetriever
from langchain.agents import initialize_agent

# Initialize embeddings (GPU-accelerated)
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cuda'}
)

# Create vector store for agent memory
vectorstore = Chroma(
    collection_name="agent_memory",
    embedding_function=embeddings,
    persist_directory="/data/chroma"
)

# Memory retriever
memory = VectorStoreMemoryRetriever(
    vectorstore=vectorstore,
    search_kwargs={"k": 5}
)

# Agent with long-term memory
agent = initialize_agent(
    tools=tools,
    llm=llm,
    agent="conversational-react-description",
    memory=memory,
    verbose=True
)

# Agent remembers context across conversations
response1 = agent.run("My name is Alex and I work on AI research")
# Later conversation:
response2 = agent.run("What do you know about me?")
# Agent recalls: "You're Alex, working on AI research"

Performance Optimization

Batched Inference for Multiple Agents:

# batch_agents.py
from vllm import LLM, SamplingParams
import asyncio

llm = LLM(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    max_model_len=8192
)

async def process_agent_requests(requests):
    # Batch 10 agent queries together
    prompts = [r["prompt"] for r in requests]

    # vLLM automatically batches for efficiency
    outputs = llm.generate(prompts, sampling_params)

    return [{"response": o.outputs[0].text} for o in outputs]

# Throughput improvement:
# Sequential: 25 requests/minute
# Batched: 80 requests/minute (3.2x faster)

GPU Selection by Agent Complexity:

Agent Type	Model Size	Recommended GPU	Cost/Hour	Latency
Simple Q&A	7B	RTX 4090	$0.18	50-80ms
RAG Agent	8B-13B	RTX 4090 or A100	$0.18-1.10	80-120ms
Code Agent	13B-34B	A100	$1.10	150-250ms
Multi-Agent	8B-70B	2-4x A100	$2.20-4.40	200-500ms
Enterprise	70B+	H100	$1.49-2.20	100-200ms

Auto-Scaling Configuration

# autoscale_config.yaml
deployment:
  name: ai-agent-api
  gpu: A100
  min_replicas: 2
  max_replicas: 20

  scaling_metrics:
    - metric: request_queue_depth
      threshold: 50
      scale_up_increment: 2
      scale_down_delay: 5m

    - metric: gpu_utilization
      threshold: 80
      scale_up_increment: 1

    - metric: response_time_p95
      threshold: 500ms
      scale_up_increment: 2

  health_check:
    endpoint: /health
    interval: 30s
    timeout: 5s

Deploy with auto-scaling:

io deploy --config autoscale_config.yaml

# Scaling behavior:
# - Traffic spike: 10 req/s → 100 req/s
# - System auto-scales: 2 → 8 replicas in 3 minutes
# - Traffic drops: 100 → 20 req/s
# - System scales down: 8 → 3 replicas after 5-minute stabilization

Cost Analysis: AI Agent Hosting

Scenario: Customer support chatbot (AI agent with RAG)

Requirements:
- 1,000 conversations/day
- Avg 10 turns per conversation
- Peak: 50 concurrent conversations

io.net Setup:

GPU: 2x RTX 4090 (baseline) + auto-scale to 6x (peak)
Average utilization: 3x RTX 4090
Cost: $0.18/hour × 3 GPUs × 730 hours = $394/month

AWS Setup (equivalent):

2x g5.xlarge (A10G) + auto-scale to 6x
Average utilization: 3x g5.xlarge
Cost: $1.21/hour × 3 × 730 hours = $2,650/month

Savings: $2,256/month (85%)

Real-World Agent Examples

1. Code Generation Agent:

io deploy --image codellama:latest \
  --gpu A100 --autoscale min=1,max=5 \
  --env AGENT_TYPE=code_assistant \
  --name code-agent

# Use case: GitHub Copilot alternative
# Cost: $1.10-5.50/hour (dynamic)
# Supports: 50-250 concurrent developers

2. Research Agent:

io deploy --image research-agent:latest \
  --gpu A100 --count 2 \
  --env TOOLS=web_search,arxiv,wikipedia \
  --name research-agent

# Use case: Automated literature review
# Cost: $2.20/hour
# Throughput: 20-30 research queries/hour

3. Sales AI Agent:

io deploy --image sales-agent:latest \
  --gpu RTX4090 --replicas 3 \
  --env CRM_INTEGRATION=salesforce \
  --name sales-agent

# Use case: Lead qualification and outreach
# Cost: $0.54/hour (3x RTX 4090)
# Handles: 100+ concurrent sales conversations

Deploy AI agents on io.net with auto-scaling, low latency, and 85% cost savings vs. AWS.