Production patternsAdvanced6h

Evals & tests.

Measuring whether an AI system actually works.

What are evals?

Evals are tests for AI systems: a set of inputs with expected outcomes or scoring criteria, run against your model or pipeline to measure quality. Because LLM output is open-ended and non-deterministic, evals are how you know whether a change made the system better or worse.

Why it matters

Without evals, "improving" an LLM feature is guesswork — you tweak a prompt and hope. Evals turn that into measurement, so you can ship changes with evidence instead of vibes. As AI systems go to production, a solid eval suite is what separates a reliable feature from one that silently regresses.

What to learn

  • Building an eval dataset of inputs and expected results
  • Exact-match versus rubric-based scoring
  • LLM-as-judge and its caveats
  • Regression testing prompts and pipelines
  • Measuring RAG retrieval and answer quality
  • Catching regressions before deploy
  • Evals in CI

Common pitfall

Eyeballing a few outputs, deciding a prompt change "seems better," and shipping it. Manual spot-checks miss regressions on the cases you did not look at, and LLM output varies run to run. Build a repeatable eval set and score against it, so "better" is a measured number, not an impression.

Resources

Primary (free):

Practice

Build a small eval set for an LLM feature: a dozen inputs with expected outputs or scoring criteria. Run your current prompt against it for a baseline score, make a change, and re-run to see if the score moved. Done when you can decide a prompt change with a number, not a guess.

Outcomes

  • Build an eval dataset with expected outcomes.
  • Score open-ended output with rubrics or LLM-as-judge.
  • Catch regressions before deploying a change.
  • Run evals as part of CI.
Back to AI / ML roadmap