Smart Routing (RL Sidecar)
aide includes a CPU-only reinforcement learning sidecar that learns which agent handles which type of task best. It runs entirely on your local machine — zero cloud LLM cost for routing decisions.
How it works
The sidecar uses a LinUCB contextual bandit:
- Each task is classified into 12 categories (infra, gpu, memory, disk, network, pipeline, parser, web, test, debug, deploy, security) via keyword matching
- Each agent has a bandit "arm" that tracks its performance per category
- When routing, the bandit scores each agent: exploitation (past success) + exploration (uncertainty bonus)
- After each dispatch, the reward signal updates the bandit:
r = 0.7 * success + 0.3 * token_efficiency
The entire computation is a 12x12 matrix inverse — microseconds on any CPU.
MEDS failure clustering (Phase C)
When running aide policy-update --full, the sidecar also clusters past failures using embedding similarity:
- Failed tasks are embedded via local ollama (
nomic-embed-text, 768 dims) - Greedy cosine-similarity clustering groups recurring failure patterns
- At dispatch time, tasks similar to known failure patterns get a penalty for the agent that failed them
This is inspired by MEDS (Memory-Enhanced Dynamic Reward Shaping) — density-based clustering of recurring error patterns penalizes repeated failures.
Files
| File | Purpose |
|---|---|
~/.aide/bandit.json | Per-agent LinUCB state (A matrices, b vectors) |
~/.aide/policy.toml | Routing weights per category, updated by policy-update |
~/.aide/failure_patterns.json | Failure cluster centroids (Phase C) |
Usage
Automatic routing
# Let the bandit pick the best agent for a task
aide dispatch --auto "check GPU utilization on all nodes"
Output:
auto-selected: infra-guardian (score: 1.26)
runner-up: pipeline-doctor (1.12)
Manual policy update
# Quick update: bandit only
aide policy-update
# Full update: bandit + failure clustering (requires ollama)
aide policy-update --full
Daemon integration
The daemon (aide up) runs policy-update automatically:
- Once on startup
- Every hour thereafter
Requirements
| Feature | Requirement |
|---|---|
| Bandit routing | None (pure CPU) |
--auto dispatch | Registered agents + events history |
| Failure clustering | ollama + ollama pull nomic-embed-text |
Configuration
No configuration needed. The sidecar learns from your dispatch history automatically. The exploration parameter (alpha = 0.5) balances trying new agents vs exploiting known-good ones.