Smart Routing (RL Sidecar)

aide includes a CPU-only reinforcement learning sidecar that learns which agent handles which type of task best. It runs entirely on your local machine — zero cloud LLM cost for routing decisions.

How it works

The sidecar uses a LinUCB contextual bandit:

  1. Each task is classified into 12 categories (infra, gpu, memory, disk, network, pipeline, parser, web, test, debug, deploy, security) via keyword matching
  2. Each agent has a bandit "arm" that tracks its performance per category
  3. When routing, the bandit scores each agent: exploitation (past success) + exploration (uncertainty bonus)
  4. After each dispatch, the reward signal updates the bandit: r = 0.7 * success + 0.3 * token_efficiency

The entire computation is a 12x12 matrix inverse — microseconds on any CPU.

MEDS failure clustering (Phase C)

When running aide policy-update --full, the sidecar also clusters past failures using embedding similarity:

  1. Failed tasks are embedded via local ollama (nomic-embed-text, 768 dims)
  2. Greedy cosine-similarity clustering groups recurring failure patterns
  3. At dispatch time, tasks similar to known failure patterns get a penalty for the agent that failed them

This is inspired by MEDS (Memory-Enhanced Dynamic Reward Shaping) — density-based clustering of recurring error patterns penalizes repeated failures.

Files

FilePurpose
~/.aide/bandit.jsonPer-agent LinUCB state (A matrices, b vectors)
~/.aide/policy.tomlRouting weights per category, updated by policy-update
~/.aide/failure_patterns.jsonFailure cluster centroids (Phase C)

Usage

Automatic routing

# Let the bandit pick the best agent for a task
aide dispatch --auto "check GPU utilization on all nodes"

Output:

auto-selected: infra-guardian (score: 1.26)
  runner-up: pipeline-doctor (1.12)

Manual policy update

# Quick update: bandit only
aide policy-update

# Full update: bandit + failure clustering (requires ollama)
aide policy-update --full

Daemon integration

The daemon (aide up) runs policy-update automatically:

  • Once on startup
  • Every hour thereafter

Requirements

FeatureRequirement
Bandit routingNone (pure CPU)
--auto dispatchRegistered agents + events history
Failure clusteringollama + ollama pull nomic-embed-text

Configuration

No configuration needed. The sidecar learns from your dispatch history automatically. The exploration parameter (alpha = 0.5) balances trying new agents vs exploiting known-good ones.