🧪

LLM Evals

QualityA repeatable testing system for LLM quality

Evals are the repeatable measurement system used to compare LLM behavior over time. They make it possible to tell whether prompt, model, retrieval, or workflow changes actually improved the system or just moved failures around.

▶Architecture Diagram

🔄 Process

🗃️Test Set

🤖Candidate System

⚖️Judge

📈Metrics

🩹Failure Slices

🚦Release Decision

Dashed line animations indicate the flow direction of data or requests

Why do you need it?

LLM systems are variable enough that a few spot checks can look good while real failures still grow elsewhere. If quality is judged only by manual impressions, regressions hide easily, especially in specific user slices or failure categories that do not show up in a small demo set.

Why did this approach emerge?

Early teams often changed prompts or models and checked a few examples by hand. As systems became more complex and changed more frequently, that approach failed to catch regressions. Evals emerged as the practical replacement for intuition-only quality checks.

How does it work inside?

A team defines a representative dataset and scoring criteria that reflect real failure modes. Candidate systems are run on that set, and outputs are scored using exact rules, model judges, human review, or a combination. The most useful result is often not the average score, but the slices, specific subsets of requests, that reveal where the system still breaks.

Boundaries & Distinctions

Evals, observability, and guardrails all deal with failures, but they act at different times. If you need a controlled before-and-after comparison before release, use evals. If you need to understand why production is failing now, use observability. If you need to block or soften failures at runtime, use guardrails. Strong eval scores still do not guarantee that new live traffic will behave the same way.

When should you use it?

Teams use evals before prompt changes, retrieval tuning, model routing changes, and safety updates. Average metrics alone are rarely enough. The better practice is to keep adding newly discovered production failures into the eval set so the test bed stays aligned with reality.

Prompt comparisonsRAG tuningModel swapsRelease gates