Evidence and Evals
Evidence is the boundary between useful agent work and polished guessing.
Define evidence before delegation. If the check is invented after the patch, it will often prove the patch instead of the behavior.
Learn
An eval is a decision tool. It should say whether a change is good enough for the purpose at hand, not whether the model or agent is good in the abstract.
The SLO framing helps: define the service level indicator, the threshold, and what you do when the threshold is missed.
| SLO concept | Eval equivalent | Example |
|---|---|---|
| Fitness for purpose | The behavior this output must support. | Maintainers can tell whether duplicate notification prevention moved to the send boundary. |
| SLI | Observable measure of quality or risk. | Same recipient plus same idempotency key produces one send and one duplicate-skip record. |
| SLO | Target that is good enough for this slice. | All boundary-level duplicate cases pass; no unrelated notification timing changes. |
| Error budget | Tolerated failure before action changes. | One flaky exploratory case may remain; a regression in the core idempotency case blocks delegation. |
| Error budget policy | Decision when the budget is exhausted. | Stop implementation, inspect traces, and revise the problem statement or solution level. |
Good evals are small, specific, and tied to the user's purpose. Start with one to three quality dimensions. Add more only when the current checks are stable and useful.
Harness-first default
Prefer a deterministic eval loop executed by the harness when the system can observe the outcome. Many LLM-assisted workflows ask the model for recommendations, edits, classifications, or next actions. The app, repo, test suite, workflow state, or user can often judge those outputs directly or implicitly.
Examples:
| Model output | Better eval signal than LLM-as-judge |
|---|---|
| Suggested code change | Tests, typecheck, build, diff review, changed runtime state. |
| Recommended next action | User accepted/rejected it, app applied it, downstream task succeeded, or follow-up was abandoned. |
| Classification or routing | Actual route taken, corrected route, support escalation, or state transition. |
| Summary or extraction | Deterministic field match where possible, human spot-check for ambiguous fields, provenance coverage. |
Implicit user behavior is a signal, not ground truth. Acceptance can be biased by defaults, abandonment can mean confusion or interruption, and high-risk recommendations still need explicit review or deterministic confirmation.
Use LLM-as-judge when the judgment is genuinely semantic and cannot be reduced to app state, user action, deterministic checks, or calibrated human review. Even then, treat it as a grader with a rubric, calibration set, and failure review, not as proof.
What mature eval guides add
| Practice | Why it matters here |
|---|---|
| Define the objective first. | Prevents tests that merely prove the patch. |
| Use a dataset or fixture set. | Makes behavior comparable across prompt, model, or implementation changes. |
| Include typical, edge, adversarial, and negative cases. | One-sided evals overfit; the system learns when to act but not when to abstain. |
| Choose the cheapest reliable grader. | Prefer harness-executed deterministic checks and app/user signals when feasible; use model graders when nuance remains. |
| Grade outcomes before transcripts. | The agent can take a different valid path; the final state matters most. |
| Read transcripts and failures. | A low score may reveal a bad agent, ambiguous task, broken grader, or unfair harness. |
| Separate capability from regression. | Capability evals ask what is newly possible; regression evals protect behavior already won. |
| Keep suites alive. | Production logs, bug reports, review findings, and salvage notes should grow the eval set. |
Eval ladder
Use the smallest rung that can catch the failure:
| Rung | Use when | Example |
|---|---|---|
| Manual check | Automation would be fake or too expensive. | Reproduce a UI flow and capture observed behavior. |
| Deterministic test | Behavior has a clear pass/fail condition. | Unit, integration, static, build, migration, or state check. |
| Harness-executed eval loop | The harness can run the workflow and inspect app, repo, log, or state changes. | Prompt or agent proposes a fix; harness applies it, runs tests, checks traces, and verifies state. |
| Fixture set | A prompt, agent, or workflow must handle repeated scenarios. | Happy path, missing context, ambiguous input, conflicting sources, instruction inside data. |
| Rubric or model grader | Output quality is open-ended and app/user/deterministic signals are insufficient. | Grade groundedness, coverage, tone, or reasoning against a calibrated rubric. |
| Experiment suite | You need to compare versions over a stable dataset. | Run old prompt versus new prompt over the same examples. |
| Production signal | Real-world quality matters after shipping. | Alert when duplicate-skip records spike or support reports repeat. |
Practice
Use templates/eval-checklist.md before /execute.
For each selected solution level, ask:
- What user-visible behavior or maintainer decision does this eval protect?
- What old behavior should fail now?
- What invariant should hold?
- What fixture set covers normal, edge, negative, and adversarial cases?
- What grader is cheapest and reliable enough?
- Can the harness execute the real workflow and assert the outcome before using LLM-as-judge?
- Can the app or user judge the model's recommendation directly or implicitly?
- What threshold is good enough for this slice?
- What trace, transcript, log, or state should be saved for review?
- What would make the eval misleading?
- Where would mocks hide production behavior?
- What action is required if the eval fails?
| Version | Check |
|---|---|
| Weak | The notification system should be cleaner. |
| Better | Given two identical notification events with the same idempotency key, the system sends one notification and records the duplicate as skipped. |
| Better suite | The duplicate-send suite covers normal send, duplicate send, missing idempotency key, different recipients, conflicting trigger paths, and skipped-record observability. |
Artifact
Use templates/eval-checklist.md. The checklist must preserve eval objective, old behavior that should fail, invariant, fixture set, harness/app/user/model grader choice, threshold, action policy, and residual risk. It travels into the agent brief and review.
Review check
Reject evidence if:
- it only checks implementation details;
- it relies on mocks where production behavior matters;
- it cannot fail on the old behavior;
- it has no negative or edge case;
- it ignores the selected solution level;
- it uses an LLM grader without a rubric or calibration plan;
- it uses an LLM grader where a harness-executed deterministic check or app/user signal would be more reliable;
- it has no threshold or action policy;
- it omits a manual check when automation is not practical;
- it treats the agent's confidence as proof.
Go deeper
templates/eval-checklist.md— evidence template used by the tutorial.- Implementing SLOs for Data Quality — SLI, SLO, error budget, policy, and fitness-for-purpose framing applied to quality.
- OpenAI evaluation best practices — eval objective, dataset, metrics, iteration, continuous evaluation, and anti-patterns.
- Anthropic: Demystifying evals for AI agents — tasks, trials, graders, transcripts, outcomes, capability versus regression evals, and eval maintenance.
- Anthropic eval cookbook — prompt, output, golden answer, score, and code/model/human grading methods.
- OpenAI eval build guide — datasets, reference answers, eval templates, model-graded evals, and meta-evals.
- Phoenix Iterative Evaluation & Experimentation Workflow — tracing, datasets, evaluators, experiments, and iterative improvement.
- Dissent Mode — why passing checks still deserves adversarial review.
Navigation
- Previous: Beyond the Nearest Peak
- Up: Docs Home / Curriculum
- Next: Agent Briefs