Building AI Agents for Infrastructure
Learn the patterns for building specialized infrastructure agents that know YOUR systems
Who this is for: DevOps engineers, platform teams, CTOs evaluating AI
ChatGPT knows about DevOps. It can explain Kubernetes concepts, suggest Terraform patterns, and help debug error messages.
But it can’t tell you why YOUR pods are crashing at 3am.
This guide teaches you the patterns for building AI agents that go beyond generic advice—agents that understand your specific infrastructure and can take meaningful action.
Why Vertical Agents Beat Generic AI
The Problem with Generic AI
Ask ChatGPT “How do I fix a Kubernetes pod crash loop?” and you’ll get a comprehensive answer covering all possible causes. That’s helpful for learning, but not for solving YOUR problem at 2am.
Generic AI gives you:
- Broad explanations that apply to everyone
- Suggestions that may not fit your architecture
- No context about your specific systems
- Potential for hallucination on specifics
The Vertical Agent Approach
Vertical agents flip this model:
| Aspect | Generic AI | Vertical Agent |
|---|---|---|
| Scope | Everything | One domain |
| Context | None | Your infrastructure |
| Output | Generic advice | Actionable steps |
| Hallucination | Higher risk | Lower (constrained domain) |
A monitoring agent that knows YOUR infrastructure can tell you:
- “Pod X is crash looping because the database connection pool is exhausted”
- “This happened twice last month after the traffic spike on Tuesday”
- “Runbook 7.3 addresses this—should I apply it?”
Webera’s Agent Philosophy
We don’t build one AI that does everything. We build 8 specialists that each do one thing excellently:
- Sentinel — Monitoring and observability
- Guardian — Security and compliance
- Optimizer — Cost and performance
- Conductor — CI/CD and deployment
- Dispatcher — Alerting and routing
- Navigator — Discovery and documentation
- Keeper — Secrets and access management
- Warden — Audit and governance
Each agent is a deep expert in its domain, with context about YOUR systems.
Part 1: Anatomy of an Infrastructure Agent
Every effective agent needs three things: Why, How, and What.
The Why: Identity and Mission
Before writing any code, define:
- Who is this agent? Give it a name and personality
- What problem does it solve? One specific problem, not many
- What does success look like? Measurable outcomes
Example: Sentinel (Monitoring Agent)
identity:
name: Sentinel
tagline: "Watching while you sleep"
mission: Ensure no issue goes undetected
success_metrics:
- Zero surprise outages
- Mean time to detection < 5 minutes
- False positive rate < 10%
The How: Operating Principles
Define how the agent makes decisions:
- What can it do autonomously? Read-only operations, non-production changes
- What requires approval? Production changes, deletions
- Who does it collaborate with? Other agents, humans
The What: Specific Responsibilities
List the concrete things this agent does:
- Domain-specific knowledge it needs
- Workflows it executes
- Outputs it produces
- Reference materials it uses
Part 2: Decision Authority—The Critical Pattern
Without clear authority boundaries, agents either ask permission for everything (useless) or act autonomously on everything (dangerous).
The Authority Matrix
| Action Type | Authority Level |
|---|---|
| Read-only discovery | Autonomous |
| Assessment and analysis | Autonomous |
| Non-production changes | Autonomous |
| Production proposals | Autonomous to propose |
| Production execution | Requires approval |
| Delete or remove anything | Requires approval |
Implementing the Matrix
Here’s how a well-designed agent handles a request:
Request: “Set up monitoring for production”
- AUTONOMOUS: Discover current infrastructure
- AUTONOMOUS: Assess what’s missing
- AUTONOMOUS: Propose monitoring stack
- APPROVAL REQUIRED: Execute changes to production
- AUTONOMOUS: Verify and document
The agent does the thinking, humans approve the action.
Why This Matters
Consider two scenarios:
Scenario A: No authority matrix Agent receives alert about high CPU. Does it scale up? Does it investigate? Does it wake someone up? Without clear boundaries, it either does nothing useful or does something dangerous.
Scenario B: Clear authority matrix Agent receives alert. It AUTONOMOUSLY investigates and correlates with recent deployments. It AUTONOMOUSLY proposes a rollback with supporting evidence. It REQUIRES APPROVAL before executing the rollback.
The second agent is useful AND safe.
Part 3: Context Injection—Your Infrastructure, Not Generic Advice
The difference between “ChatGPT knows DevOps” and “Our agents know YOUR infrastructure” is context.
The Context File Pattern
Agents need structured knowledge about YOUR systems:
# .webera/context.yaml
infrastructure:
cloud: aws
region: us-east-1
account_id: "123456789012"
kubernetes:
version: "1.28"
cluster: "production-eks"
namespaces:
- name: api
criticality: high
- name: workers
criticality: medium
databases:
- type: postgresql
version: "15"
name: "primary-db"
rds_instance: "db.r6g.xlarge"
services:
- name: api
repository: "company/api"
criticality: high
dependencies: [primary-db, redis]
sla_target: "99.9%"
- name: worker
repository: "company/worker"
criticality: medium
dependencies: [primary-db, rabbitmq]
Why Context Beats Prompting
Without context file:
User: "Why is my API slow?"
Agent: "There could be many reasons. Check your database queries,
network latency, CPU usage..."
With context file:
User: "Why is my API slow?"
Agent: "Your API service depends on primary-db (PostgreSQL 15 on
db.r6g.xlarge). Checking CloudWatch metrics... Connection
pool is at 95% capacity. This matches the pattern from last
Tuesday's incident. Recommend increasing pool size per
runbook 4.2."
The difference is actionable specificity.
Keeping Context Updated
Context files should be:
- Versioned — In your git repository
- Auto-discovered — Agents can update them (with approval)
- Validated — Schema-checked to prevent errors
Part 4: Inter-Agent Collaboration
Single agents are limited. Agent systems are powerful.
The Handoff Pattern
Agents work together through defined handoffs:
Sentinel (monitoring) ──detects issue──► Dispatcher (routing)
Dispatcher ──routes to──► On-call engineer
Guardian (security) ──secures──► Conductor (pipelines)
Optimizer (cost) ◄──metrics from── Sentinel (monitoring)
Designing Handoffs
Each handoff needs:
- Clear trigger — When does the handoff happen?
- Context passing — What information transfers?
- Acknowledgment — How does the receiving agent confirm?
Example handoff:
handoff:
from: sentinel
to: dispatcher
trigger: alert_threshold_exceeded
context:
- alert_type
- affected_service
- metrics_snapshot
- suggested_runbook
acknowledgment: dispatcher_received
Real-World Example
1. Sentinel detects: High error rate on API service (5xx > 1%)
2. Sentinel outputs:
- Alert with context (service, metrics, timeframe)
- Correlation with recent events
- Suggested runbook
3. Sentinel suggests: "Engage Dispatcher to route this alert"
4. Dispatcher receives: Alert + context
5. Dispatcher checks: Runbook exists for this scenario
6. Dispatcher decides: Route to API team based on on-call schedule
7. Dispatcher notifies: Slack + PagerDuty with full context
No human intervention until step 7. But humans stay in control.
Part 5: Client Customization
The same agent should behave differently for different contexts.
Why Customization Matters
- Client A: SOC 2 focused, strict change control, requires approval for everything
- Client B: Move fast, break things (but fix fast), autonomous for non-production
Same agent, different behavior.
The Customization Pattern
# Client-specific settings
agent_customization:
sentinel:
focus_areas:
- "API latency"
- "Database connections"
ignore_namespaces:
- "kube-system"
- "monitoring"
alert_threshold_multiplier: 1.5 # More lenient
notes: "Previous P1 was API latency related - prioritize"
guardian:
compliance_focus:
- "SOC2"
- "HIPAA"
backup_priority: "critical"
approval_required_for: "all_changes"
notes: "Healthcare client, strict compliance required"
Implementation Tips
- Check customization first — Before any action, load client settings
- Apply notes as context — Historical notes inform current decisions
- Default to safe — If no customization, use conservative defaults
Part 6: Building Your First Agent
Ready to build? Here’s the step-by-step process.
Step 1: Choose a Narrow Domain
NOT this: “Infrastructure agent” DO this: “Monitoring and alerting agent”
Narrow scope = deep expertise = better results.
Step 2: Define Identity (Why)
identity:
name: [Agent name]
tagline: [One-line mission]
problem_solved: [Specific problem]
success_criteria:
- [Measurable outcome 1]
- [Measurable outcome 2]
Step 3: Define Authority (How)
authority:
autonomous:
- Read infrastructure state
- Analyze metrics and logs
- Generate reports
- Propose changes
requires_approval:
- Execute production changes
- Modify security settings
- Delete resources
handoffs_to:
- [Other agent for related work]
Step 4: Define Knowledge (What)
knowledge:
domain_expertise:
- [Technical area 1]
- [Technical area 2]
reference_materials:
- [Documentation source]
- [Runbook location]
output_formats:
- [Report type]
- [Alert format]
Step 5: Create Context Injection
Define what the agent needs to know about each infrastructure:
- Service inventory
- Dependencies
- SLAs and criticality
- Historical incidents
Step 6: Test Incrementally
- Read-only first — Can it correctly understand the infrastructure?
- Analysis second — Are its assessments accurate?
- Proposals third — Are suggested actions appropriate?
- Execution last — Does it execute safely with approval?
Part 7: Common Pitfalls
| Pitfall | Problem | Solution |
|---|---|---|
| Too broad | Agent doesn’t know when to engage | Narrow the domain |
| No authority matrix | Asks permission for everything | Define autonomous actions |
| No context | Generic advice, not specific | Inject infrastructure context |
| No handoffs | Agent works in isolation | Define relationships |
| No customization | Same behavior for all clients | Add client-specific settings |
| No testing | Dangerous in production | Test incrementally |
Why We Built 8 Agents, Not 1
The Temptation
“Build one AI that handles all DevOps.”
It sounds efficient. One system to rule them all.
The Reality
Vertical beats horizontal for specialized domains:
- Monitoring requires different expertise than security
- Cost optimization requires different context than deployment
- 8 specialists > 1 generalist
Each of our agents is a deep expert in one thing. They collaborate when needed, but they don’t try to do everything.
Our Agents Know YOUR Infrastructure Because:
- They read your context files
- They apply your customizations
- They follow your runbooks
- They integrate with your tools
- They learn from your incidents
ChatGPT knows about DevOps. Our agents know YOUR infrastructure.
Next Steps
Option 1: Build Your Own
Use this guide to create agents for your specific needs. Start narrow, test thoroughly, expand gradually.
Option 2: Use Ours
Our 8 agents are already built, tested, and battle-hardened across dozens of client infrastructures.
Book a discovery call to see how our AI team can work with yours.
Option 3: Hybrid
Some clients use our agents while building their own for specific domains. We’re happy to share patterns and collaborate.
Further Reading
Ready to accelerate your infrastructure?
Our team of senior engineers + AI agents can implement these practices in days, not months.
Book a Discovery Call