Muness tutorial

Ship with AI without losing judgment.

A practical curriculum for builders using LLMs inside real systems: state the intent, select context, compare solution levels, verify the work, and preserve what the next session needs.

The failure mode

Fluent output is not the same thing as aligned work.

The tutorial starts from the failure everyone recognizes: the model produces a decent-looking patch, the chat summary sounds confident, and nobody can tell whether the work served the aim.

The fix is not more ceremony. It is a compact loop where every artifact preserves a decision, an assumption, an evidence check, and the next consumer.

Choose a path

Three ways through the material.

Start with the full builder loop when you have a project slice. Use the focused tracks when the active bottleneck is reusable interfaces or eval design.

01 / Full path

Builder loop

Clarify the aim, map the terrain, choose a solution level, define evidence, execute, review, and preserve what survives.

  • Best for a real codebase improvement
  • Produces a complete artifact set
  • Teaches when to stop, dissent, or salvage
02 / Interface path

Context to agent

Turn selected context into a checkable prompt, then decide what stays dynamic and what deserves a skill or subagent boundary.

  • Best for repeated prompt shapes
  • Separates context, prompt, skill, and role
  • Prevents stale project facts becoming policy
03 / Evidence path

Eval design

Turn “the agent passed” into a purpose, signal, fixture set, grader, threshold, action policy, and failure-reading loop.

  • Best for reused prompts and agent behavior
  • Prefers harness or app signals over vibes
  • Feeds real failures back into the suite
Curriculum map

The four loops that make AI work inspectable.

The tutorial uses the Open Horizons loop at different altitudes. Review, dissent, and salvage are available at any point; they are checks, not final ceremony.

Loop 1Ground the ask

Name the behavior change, fit the model to the task, select context, and assemble a prompt with reviewer checks.

Loop 2Frame the work

Map constraints and terrain, choose a problem statement, then compare band-aid, local optimum, reframe, and redesign options.

Loop 3Execute with evidence

Define checks before implementation, delegate one bounded slice, and review against the aim rather than the summary.

Loop 4Preserve what survives

Promote repeated procedures to skills, bounded roles to subagents, and hard-won lessons to durable knowledge.

Builder loop tutorial

Use one real project slice.

Do not use a blank repo. Judgment needs terrain: tests, multiple subsystems, a known annoyance, and enough history that technical debt is not hypothetical.

Intent engineering

Write one sentence that names the outcome, not the activity. Pause after a short burst of likely causes, files, checks, and failure modes.

Model-fit framing

Name the work the model is suited to do, the context it needs, what it must not infer, and how a reviewer can check it.

Context pack

Select context with provenance: project shape, relevant files, constraints, prior attempts, landmines, checks, and stop triggers.

Prompt assembly

Combine stable instructions, dynamic context, success criteria, examples where needed, output contract, and missing-evidence behavior.

/aim

Turn the intent note into an aim statement with current state, desired state, mechanism, assumptions, feedback signal, and guardrails.

/problem-space

Map systems, actors, repeated symptoms, constraints, assumptions to test, existing evidence, prior attempts, and blast radius.

/problem-statement

Compare symptom, systems, and maintainer-outcome framings. Select one, reject the others, and name what evidence would invalidate it.

/solution-space

Generate breadth before judgment. Score band-aid, local optimum, reframe, and redesign options against impact, cost, testability, reversibility, and maintenance burden.

Evidence before delegation

Define the old behavior that should now fail, the invariant that must hold, positive and negative cases, threshold, action policy, and residual risk.

Agent brief, execute, review

Hand off a bounded slice, run the checks, review the diff against the aim, use dissent when confidence outruns scrutiny, and salvage when work drifts.

Focused tutorial

Context to agent interface.

Use this path when a one-off prompt starts becoming an interface the next session should reuse. The rule is simple: context stays current, prompts assemble the current request, skills preserve repeated procedures, and subagents preserve bounded roles.

Context packTask identity, selected sources, provenance, constraints, assumptions, and stop triggers.
Prompt assemblyStable instruction, dynamic context, output contract, fixtures, and reviewer checks.
Project skillRepeated inspection sequence, checklist, command order, evidence gate, or stop condition.
Subagent roleIndependent reviewer, scout, implementer, extractor, or domain validator with a narrow contract.
Focused tutorial

Eval design when “passed” is too vague.

Start by naming fitness for purpose, not by choosing a metric. Prefer harness-executed outcome checks and app or user signals before reaching for model graders.

PurposeWho relies on the behavior, what decision it protects, and which failure matters.
SignalObservable SLI, target threshold, tolerated failure, and action on miss.
FixturesHappy path, regression, edge, negative, ambiguous, conflicting evidence, and instruction-in-data cases.
Failure readingInspect traces and outputs; failures may indicate agent, prompt, context, grader, harness, or framing problems.
What good looks like

The artifact set should let the next session continue.

A future agent or maintainer should recover the aim, constraints, chosen framing, evidence checks, role boundaries, and failure modes without rediscovering the whole problem.

Intent noteDesired behavior change, burst findings, pause questions, and invalidation signal.
Problem statementChosen framing, rejected framings, scope boundary, and handoff question.
Solution comparisonOptions at each solution level, scoring criteria, selected level, and rejected paths.
Evidence checklistObjective, fixtures, grader or command, threshold, action policy, and residual risk.
Review findingsSpecific drift, missing evidence, incomplete changes, or reason to continue.
Knowledge artifactMetis, signal, guardrail, outcome update, ADR, or salvage note that changes the next run.
Use it now

Install the skills, pick one real slice, and start with intent.

The tutorial is designed to be used while working, not consumed as homework. Read the deep dive when that step becomes the active bottleneck.

npx skills add open-horizon-labs/skills -g -a claude-code -y