aide ab

A/B comparison harness for routing strategies (ICSE issue #95).

Compares two arms over finished events in ~/.aide/events.jsonl:

  • Treatment: bandit — LinUCB + MEDS failure penalty (set when aide dispatch --auto).
  • Control: round-robin — flat cycling over candidates.

Each dispatched event carries an optional routing field; the matching finished event copies it via issue lookup.

Subcommands

aide ab analyze [--events PATH]

Partition finished events by arm and print a per-metric table plus deltas. Warns if either arm has fewer than 5 events; target is 25+ per arm.

ARM              N   SUCCESS_RATE   AVG_TOKENS   AVG_TIME_MS   AVG_COMPRESSION  AVG_REWARD
──────────────────────────────────────────────────────────────────────────────────────
round-robin     50          0.620      33102.0       40221.0            0.0146       0.730
bandit          50          0.840      18011.0       28019.0            0.0140       0.851

deltas (bandit − round-robin; rel = (b-r)/|r|):
  success_rate       abs=     +0.2200  rel= +35.48%
  avg_tokens         abs= -15091.0000  rel= -45.59%
  ...

Default events path: ~/.aide/events.jsonl.

aide ab export [--events PATH] [--out PATH.csv]

Emit a CSV with one row per finished event. Columns:

timestamp,arm,agent,task_category,success,cloud_tokens,local_cpu_ms,compression_ratio,reward,duration_ms

Default output: ./ab-export.csv. task_category is keyword-inferred from the issue + task snippet.

aide ab simulate --n N [--seed N]

Generate N synthetic dispatched+finished event pairs per arm, write to a fresh tempfile (NEVER touches production), and run analyze on it. Useful for validating the analysis pipeline before real experiment data lands.

aide ab simulate --n 50 --seed 42

The simulator is hardcoded so bandit beats round-robin in success rate and tokens, matching the hypothesis under test.

Statistical approach

The harness reports raw absolute and relative deltas (no t-test or bootstrap CI). Rationale: with the target n=25 per arm, classical NHST is underpowered for small effect sizes anyway; deltas + per-event CSV export let downstream notebooks (R / pandas) run whatever test the paper demands.

See also