Guide

Evidence and Evals

Evidence is the boundary between useful agent work and polished guessing.

Source: docs/evidence-and-evals.md

Evidence and Evals

Evidence is the boundary between useful agent work and polished guessing.

Define evidence before delegation. If the check is invented after the patch, it will often prove the patch instead of the behavior.

Learn

An eval is a decision tool. It should say whether a change is good enough for the purpose at hand, not whether the model or agent is good in the abstract.

The SLO framing helps: define the service level indicator, the threshold, and what you do when the threshold is missed.

SLO conceptEval equivalentExample
Fitness for purposeThe behavior this output must support.Maintainers can tell whether duplicate notification prevention moved to the send boundary.
SLIObservable measure of quality or risk.Same recipient plus same idempotency key produces one send and one duplicate-skip record.
SLOTarget that is good enough for this slice.All boundary-level duplicate cases pass; no unrelated notification timing changes.
Error budgetTolerated failure before action changes.One flaky exploratory case may remain; a regression in the core idempotency case blocks delegation.
Error budget policyDecision when the budget is exhausted.Stop implementation, inspect traces, and revise the problem statement or solution level.

Good evals are small, specific, and tied to the user's purpose. Start with one to three quality dimensions. Add more only when the current checks are stable and useful.

Harness-first default

Prefer a deterministic eval loop executed by the harness when the system can observe the outcome. Many LLM-assisted workflows ask the model for recommendations, edits, classifications, or next actions. The app, repo, test suite, workflow state, or user can often judge those outputs directly or implicitly.

Examples:

Model outputBetter eval signal than LLM-as-judge
Suggested code changeTests, typecheck, build, diff review, changed runtime state.
Recommended next actionUser accepted/rejected it, app applied it, downstream task succeeded, or follow-up was abandoned.
Classification or routingActual route taken, corrected route, support escalation, or state transition.
Summary or extractionDeterministic field match where possible, human spot-check for ambiguous fields, provenance coverage.

Implicit user behavior is a signal, not ground truth. Acceptance can be biased by defaults, abandonment can mean confusion or interruption, and high-risk recommendations still need explicit review or deterministic confirmation.

Use LLM-as-judge when the judgment is genuinely semantic and cannot be reduced to app state, user action, deterministic checks, or calibrated human review. Even then, treat it as a grader with a rubric, calibration set, and failure review, not as proof.

What mature eval guides add

PracticeWhy it matters here
Define the objective first.Prevents tests that merely prove the patch.
Use a dataset or fixture set.Makes behavior comparable across prompt, model, or implementation changes.
Include typical, edge, adversarial, and negative cases.One-sided evals overfit; the system learns when to act but not when to abstain.
Choose the cheapest reliable grader.Prefer harness-executed deterministic checks and app/user signals when feasible; use model graders when nuance remains.
Grade outcomes before transcripts.The agent can take a different valid path; the final state matters most.
Read transcripts and failures.A low score may reveal a bad agent, ambiguous task, broken grader, or unfair harness.
Separate capability from regression.Capability evals ask what is newly possible; regression evals protect behavior already won.
Keep suites alive.Production logs, bug reports, review findings, and salvage notes should grow the eval set.

Eval ladder

Use the smallest rung that can catch the failure:

RungUse whenExample
Manual checkAutomation would be fake or too expensive.Reproduce a UI flow and capture observed behavior.
Deterministic testBehavior has a clear pass/fail condition.Unit, integration, static, build, migration, or state check.
Harness-executed eval loopThe harness can run the workflow and inspect app, repo, log, or state changes.Prompt or agent proposes a fix; harness applies it, runs tests, checks traces, and verifies state.
Fixture setA prompt, agent, or workflow must handle repeated scenarios.Happy path, missing context, ambiguous input, conflicting sources, instruction inside data.
Rubric or model graderOutput quality is open-ended and app/user/deterministic signals are insufficient.Grade groundedness, coverage, tone, or reasoning against a calibrated rubric.
Experiment suiteYou need to compare versions over a stable dataset.Run old prompt versus new prompt over the same examples.
Production signalReal-world quality matters after shipping.Alert when duplicate-skip records spike or support reports repeat.

Practice

Use templates/eval-checklist.md before /execute.

For each selected solution level, ask:

  • What user-visible behavior or maintainer decision does this eval protect?
  • What old behavior should fail now?
  • What invariant should hold?
  • What fixture set covers normal, edge, negative, and adversarial cases?
  • What grader is cheapest and reliable enough?
  • Can the harness execute the real workflow and assert the outcome before using LLM-as-judge?
  • Can the app or user judge the model's recommendation directly or implicitly?
  • What threshold is good enough for this slice?
  • What trace, transcript, log, or state should be saved for review?
  • What would make the eval misleading?
  • Where would mocks hide production behavior?
  • What action is required if the eval fails?
VersionCheck
WeakThe notification system should be cleaner.
BetterGiven two identical notification events with the same idempotency key, the system sends one notification and records the duplicate as skipped.
Better suiteThe duplicate-send suite covers normal send, duplicate send, missing idempotency key, different recipients, conflicting trigger paths, and skipped-record observability.

Artifact

Use templates/eval-checklist.md. The checklist must preserve eval objective, old behavior that should fail, invariant, fixture set, harness/app/user/model grader choice, threshold, action policy, and residual risk. It travels into the agent brief and review.

Review check

Reject evidence if:

  • it only checks implementation details;
  • it relies on mocks where production behavior matters;
  • it cannot fail on the old behavior;
  • it has no negative or edge case;
  • it ignores the selected solution level;
  • it uses an LLM grader without a rubric or calibration plan;
  • it uses an LLM grader where a harness-executed deterministic check or app/user signal would be more reliable;
  • it has no threshold or action policy;
  • it omits a manual check when automation is not practical;
  • it treats the agent's confidence as proof.

Go deeper