Source: templates/eval-checklist.md

Evidence and Eval Checklist

Define the evidence before asking the agent to implement.

Aim

What user behavior, maintainer decision, or product outcome should this protect?

...

...

The eval answers this question:

...

The eval is useful because:

Field	Value
Fitness for purpose
SLI / observable measure
SLO / threshold
Error budget / tolerated failure
Action if budget is exhausted
Owner

The system must:

The system must reject, prevent, or avoid:

Fixture	Input or setup	Expected outcome	Grader
Happy path
Old behavior / regression
Edge case
Negative case
Ambiguous input
Conflicting source or trigger path
Instruction inside data

Grader type	Use?	Notes
Code or deterministic state check
Harness-executed workflow assertion
App or user signal
Human review
Model grader with rubric
Production signal

Before using a model grader:

If using a model grader:

[ ] rubric is explicit;
[ ] grader has a way to return uncertain or insufficient evidence;
[ ] sample outputs are calibrated against human judgment;
[ ] model grader is not the only proof for high-risk behavior.
[ ] model grader is compared with harness/app/user outcomes when those signals exist.

The output fails if it:

suppresses the symptom without addressing the selected problem;
changes unrelated behavior;
adds a second way to do the same thing;
weakens existing tests;
relies on mocks where production behavior matters;
hides errors behind broad fallbacks;
passes the grader while failing the user-visible outcome;
uses LLM-as-judge where harness, app state, or user behavior could judge the outcome;
cannot explain what would prove the patch wrong.

I still have to check:

I will know this is working if:

...