Templates

Evidence and Eval Checklist

Define the evidence before asking the agent to implement.

Source: templates/eval-checklist.md

Evidence and Eval Checklist

Define the evidence before asking the agent to implement.

Aim

What user behavior, maintainer decision, or product outcome should this protect?

...

Selected problem statement

...

Selected solution level

  • [ ] Band-Aid
  • [ ] Local Optimum
  • [ ] Reframe
  • [ ] Redesign

Eval objective

The eval answers this question:

...

The eval is useful because:

  • ...

SLO-style quality bar

FieldValue
Fitness for purpose
SLI / observable measure
SLO / threshold
Error budget / tolerated failure
Action if budget is exhausted
Owner

Behavior or invariant

The system must:

  • ...
  • ...
  • ...

The system must reject, prevent, or avoid:

  • ...
  • ...
  • ...

Fixture set

FixtureInput or setupExpected outcomeGrader
Happy path
Old behavior / regression
Edge case
Negative case
Ambiguous input
Conflicting source or trigger path
Instruction inside data

Checks to add or update

  • [ ] Unit test:
  • [ ] Integration test:
  • [ ] End-to-end or manual reproduction:
  • [ ] Harness-executed workflow assertion:
  • [ ] App/user acceptance, correction, or abandonment signal:
  • [ ] Static check / lint / typecheck:
  • [ ] Build or migration check:
  • [ ] Documentation or runbook update:
  • [ ] Trace, transcript, log, or state capture:

Grading plan

Grader typeUse?Notes
Code or deterministic state check
Harness-executed workflow assertion
App or user signal
Human review
Model grader with rubric
Production signal

Before using a model grader:

  • [ ] harness cannot execute the workflow and assert the outcome;
  • [ ] app or user signals cannot judge the recommendation directly or implicitly;
  • [ ] deterministic checks would miss the quality dimension that matters;

If using a model grader:

  • [ ] rubric is explicit;
  • [ ] grader has a way to return uncertain or insufficient evidence;
  • [ ] sample outputs are calibrated against human judgment;
  • [ ] model grader is not the only proof for high-risk behavior.
  • [ ] model grader is compared with harness/app/user outcomes when those signals exist.

Failure modes

The output fails if it:

  • suppresses the symptom without addressing the selected problem;
  • changes unrelated behavior;
  • adds a second way to do the same thing;
  • weakens existing tests;
  • relies on mocks where production behavior matters;
  • hides errors behind broad fallbacks;
  • passes the grader while failing the user-visible outcome;
  • uses LLM-as-judge where harness, app state, or user behavior could judge the outcome;
  • cannot explain what would prove the patch wrong.

Human checks

I still have to check:

  • [ ] whether the selected solution level was right;
  • [ ] whether tests prove behavior, not implementation details;
  • [ ] whether the eval has positive, negative, and edge cases;
  • [ ] whether the threshold is good enough without gold-plating;
  • [ ] whether the patch reduces future change risk;
  • [ ] whether a maintainer would understand the boundary;
  • [ ] whether I am accepting confidence instead of evidence.

Fast feedback signal

I will know this is working if:

...