Smart Routing (RL Sidecar)

aide includes a CPU-only reinforcement learning sidecar that learns which agent handles which type of task best. It runs entirely on your local machine — zero cloud LLM cost for routing decisions.

How it works

The sidecar uses a LinUCB contextual bandit:

Each task is classified into 12 categories (infra, gpu, memory, disk, network, pipeline, parser, web, test, debug, deploy, security) via keyword matching
Each agent has a bandit "arm" that tracks its performance per category
When routing, the bandit scores each agent: exploitation (past success) + exploration (uncertainty bonus)
After each dispatch, the reward signal updates the bandit: r = 0.7 * success + 0.3 * token_efficiency

The entire computation is a 12x12 matrix inverse — microseconds on any CPU.

MEDS failure clustering (Phase C)

When running aide policy-update --full, the sidecar also clusters past failures using embedding similarity:

Failed tasks are embedded via local ollama (nomic-embed-text, 768 dims)
Greedy cosine-similarity clustering groups recurring failure patterns
At dispatch time, tasks similar to known failure patterns get a penalty for the agent that failed them

This is inspired by MEDS (Memory-Enhanced Dynamic Reward Shaping) — density-based clustering of recurring error patterns penalizes repeated failures.

Files

File	Purpose
`~/.aide/bandit.json`	Per-agent LinUCB state (A matrices, b vectors)
`~/.aide/policy.toml`	Routing weights per category, updated by policy-update
`~/.aide/failure_patterns.json`	Failure cluster centroids (Phase C)

Usage

Automatic routing

# Let the bandit pick the best agent for a task
aide dispatch --auto "check GPU utilization on all nodes"

Output:

auto-selected: infra-guardian (score: 1.26)
  runner-up: pipeline-doctor (1.12)

Manual policy update

# Quick update: bandit only
aide policy-update

# Full update: bandit + failure clustering (requires ollama)
aide policy-update --full

Daemon integration

The daemon (aide up) runs policy-update automatically:

Once on startup
Every hour thereafter

Requirements

Feature	Requirement
Bandit routing	None (pure CPU)
`--auto` dispatch	Registered agents + events history
Failure clustering	ollama + `ollama pull nomic-embed-text`

Configuration

No configuration needed. The sidecar learns from your dispatch history automatically. The exploration parameter (alpha = 0.5) balances trying new agents vs exploiting known-good ones.

aide.sh