Guide

Eval Tutorial

Use this tutorial when “the agent passed” is not enough evidence.

Source: docs/evals-tutorial.md

Eval Tutorial

Use this tutorial when “the agent passed” is not enough evidence.

The builder loop already asks for evidence before delegation. This path goes deeper: turn a vague check into a small eval suite with a purpose, fixtures, grader, threshold, and action policy.

flowchart LR
    aim[Aim] --> purpose[Eval purpose]
    purpose --> sli[Signal / SLI]
    sli --> fixtures[Fixture set]
    fixtures --> grader[Grader]
    grader --> threshold[Threshold + action]
    threshold --> review[Read failures]
    review --> grow[Grow suite from real failures]

Use this when

  • a prompt, workflow, or agent behavior will be reused;
  • passing one manual check would create false confidence;
  • a model grader or rubric is being considered;
  • the team disagrees about what “good” means;
  • production failures or review findings should become repeatable cases;
  • recommendations or suggestions can be judged by app state, workflow completion, or user acceptance instead of another LLM.
  • a change needs a quality bar, not just a yes/no test.

Skip this path when a normal regression test or manual reproduction is enough.

Step 1: Name fitness for purpose

Read Evidence and Evals, then use templates/eval-checklist.md.

Borrow the SLO framing: quality only makes sense relative to the purpose the output serves.

QuestionExample
Who relies on this behavior?Maintainers changing notification logic.
What decision or outcome should the eval protect?Duplicate prevention moved to the send boundary.
What failure matters?A patch suppresses one caller path while another still duplicates sends.
What is good enough for this slice?Core idempotency cases pass and no unrelated timing behavior changes.

Do not start by choosing a metric. Start by naming the user, maintainer, or system decision the eval is supposed to protect.

Step 2: Choose the signal

Turn the purpose into an observable measure.

SLO termEval termExample
SLIObservable signalSame recipient plus same idempotency key produces one send and one duplicate-skip record.
SLOTarget thresholdAll boundary-level duplicate cases pass.
Error budgetTolerated failureOne flaky exploratory case can remain; core regression failure blocks delegation.
PolicyAction on missStop implementation, inspect traces, revise framing or solution level.

This is the part the framework was already doing implicitly through aim, mechanism, feedback, guardrails, review, dissent, and salvage. The eval tutorial makes it explicit.

Step 3: Build a small fixture set

Start with a handful of cases. Mature eval guides consistently warn against vague or one-sided evals.

Fixture typeWhat it catches
Happy pathThe expected behavior works.
Old behavior / regressionThe prior failure no longer passes.
Edge caseThe boundary is real, not accidental.
Negative caseThe system does not act when it should not.
Ambiguous inputThe system asks, scopes, or reports uncertainty instead of guessing.
Conflicting evidenceThe system reports conflict and authority instead of smoothing it over.
Instruction inside dataUser or retrieved text cannot override stable instructions.

A good fixture is unambiguous enough that two reviewers would make the same pass/fail call.

Step 4: Pick the cheapest reliable grader

Prefer harness-executed outcome checks over transcript policing or offline LLM-as-judge. Valid solutions may take paths you did not predict, and recommendations are often judged by what the app or user does next.

GraderUse whenWatch out for
Code or state checkThere is a clear pass/fail outcome.Brittle checks that reject valid variation.
Harness-executed eval loopThe harness can run the workflow and inspect app, repo, trace, or state changes.Simulated flows that miss production behavior or user context.
App or user signalThe model gives a recommendation, suggestion, route, or candidate action.Implicit signals are noisy; high-risk decisions still need explicit review.
Human reviewJudgment requires domain expertise.Cost and inconsistency. Use it to calibrate other graders.
Model graderThe output is open-ended and cannot be judged reliably by harness, app/user signal, deterministic check, or calibrated human review.Vague rubrics, grader drift, no “insufficient evidence” option.
Production signalReal users and real conditions matter after shipping.Reactive signals without a pre-ship regression suite.

For recommendation systems and assistant suggestions, ask whether the system already has a judgment signal: accepted recommendation, edited suggestion, applied patch, completed workflow, corrected route, abandoned suggestion, or repeated user override. Those signals usually beat asking another LLM whether the suggestion looked good.

If a model grader is still needed, give it a rubric, examples, and a way to say the evidence is insufficient. Periodically compare it with human judgment and deterministic outcomes.

Step 5: Set threshold and action

An eval without a decision rule becomes a dashboard. Define what happens when it fails.

ResultAction
Core regression failsDo not delegate or ship. Fix the behavior or revisit framing.
Edge fixture failsDecide whether the edge is in scope; if yes, fix; if no, record residual risk.
Model grader and human disagreeCalibrate the rubric before trusting the score.
Production signal violates thresholdSpend the error budget deliberately: pause new work, inspect traces, or open a focused task.

Avoid gold-plating. The SLO lesson is that 100% quality is neither always possible nor always worth the cost. Choose the bar that protects the purpose.

Step 6: Read failures

Scores are not self-explanatory. Read traces, transcripts, outputs, and grader explanations.

A failure can mean:

  • the agent failed;
  • the prompt or context is unclear;
  • the task is ambiguous;
  • the grader is unfair;
  • the eval used an offline model judge even though the harness or app could have checked the outcome;
  • the harness hides production behavior;
  • the selected solution level was wrong.

Feed real failures back into the suite. Review findings, dissent memos, production bugs, and salvage notes are high-value eval cases.

Output

By the end, you should have:

  • eval objective;
  • fixture set;
  • grader choice;
  • harness-executed deterministic loop, if feasible;
  • app or user judgment signal for recommendations and suggestions;
  • threshold and action policy;
  • traces or transcripts to inspect;
  • residual risk;
  • cases to add later when real failures appear.

Source references