Skip to content

Admin — Evaluations

The Evaluations pages are where operators author test cases, start runs, and drill into per-test results. They sit on top of the same GraphQL + eval-runner Lambda path the thinkwork eval CLI uses, so runs started in the UI show up in the CLI and vice versa.

Route: /evaluations File: apps/web/src/routes/_authed/_tenant/evaluations/

See the Evaluations guide for the architecture, assertion types, scoring model, and CLI reference. This page walks through the UI itself.

Every eval run executes against the tenant’s platform agent — the single agents row with is_platform_default = true. There is no per-run or per-test-case agent picker; “the tenant agent” is unambiguous under the one-platform-agent model.

If a tenant has no is_platform_default = true row (the platform-agent collapse migration has not yet been run on that stage), startEvalRun records the run with status='failed' and an error_message containing PlatformAgentNotFoundError. Complete the migration described in docs/plans/2026-05-22-005-refactor-single-platform-agent-and-space-runtime-overrides-plan.md before retrying.

Cost attribution: cost_events.agent_id on eval-source events now equals the platform agent’s id (it previously equalled a dedicated type='eval', source='system' row that has been archived). Cost summaries filtered by event_type='eval' continue to work unchanged.

The dashboard at /evaluations is the operator’s at-a-glance view.

  • Summary metric cards — Total Runs, Latest Pass Rate, Average Score, Regressions. The pass-rate card colours green ≥ 90%, yellow ≥ 70%, red below. Regressions turns red when > 0.
  • Pass Rate Trend chart — the last 30 days, zero-filled so the x-axis stays consistent even when there are only a handful of runs. Backed by EvalTimeSeriesQuery.
  • Recent Runs table — every completed/running/pending run for the tenant. Columns: status, categories, template, model, tests, pass rate, cost, date. Click any row to open the Run Results page for that run.

The header has three actions:

  • Studio/evaluations/studio — the test-case CRUD surface.
  • Schedules → the shared Scheduled Jobs page filtered to trigger_type: "eval_scheduled".
  • Run Evaluation → opens the Run Evaluation dialog.

The Categories column uses a smart renderer:

  • when a run has no categories
  • “All Categories” when the run covered every category the tenant has
  • the bare category name when there’s only one
  • “N Categories” with the count otherwise

The model column shortens Bedrock IDs so us.anthropic.claude-haiku-4-5-20251001-v1:0 renders as claude-haiku-4-5. Prefix / version / date suffixes are stripped.

The dialog is where a run gets started. It has two inputs:

FieldPurpose
ModelOptional override of the platform agent’s default eval model. Overrides must be enabled in the tenant Model Catalog.
CategoriesMulti-select pills. Click All Categories to run every enabled test case; click individual chips to run a subset.

On submit the dialog calls startEvalRun, the row shows up in Recent Runs as pending, and the eval-runner Lambda picks it up asynchronously. The invocation target is always the tenant platform agent (see Agent target above).

If an override is absent from the tenant catalog or has been disabled there, the API rejects the run instead of falling back to a globally seeded model.

The Studio at /evaluations/studio is the test-case CRUD surface. Every row is a test case stored in eval_test_cases; the row lives per-tenant and carries the assertion + evaluator config.

Route: /evaluations/studio

Columns: Name (clickable → Test Case detail), Category, Evaluators (count), Assertions (count), Enabled toggle, Updated.

Actions:

  • Import starter pack — calls the seedEvalTestCases mutation to idempotently import the 189-test ThinkWork RedTeam pack across 4 categories. Re-runs are safe (unique index on (tenant_id, name) for source='yaml-seed' rows).
  • New test case/evaluations/studio/new — the form described below.
  • Search by name — free-text filter over evalTestCases(tenantId, search).
  • Trash icon per row — deleteEvalTestCase mutation, confirms first.

The Studio auto-seeds the 189-case starter pack on a tenant’s first visit (the evalTestCases query checks for any source='yaml-seed' rows and imports if zero). The seed is idempotent so re-visits don’t duplicate.

Route: /evaluations/studio/$testCaseId

Two sections:

  • Test Configuration — query, assertions list (each with its type badge and value), AgentCore evaluators, tags. Read-only snapshot; click Edit in the header to open the editor.
  • Run History — DataTable of every eval_results row for this test case across runs. Click a row to open the side-docked Sheet with the same breakdown the Run Results page uses (Input / Expected / Actual Output / Assertions).

The header shows the Enabled badge plus Edit and Delete actions.

Route: /evaluations/studio/edit/$testCaseId (or /evaluations/studio/new)

The editor is a react-hook-form + zod flow covering every field on an eval_test_cases row:

  • Name, Category, Query, optional System Prompt
  • Assertions — typed repeater. Dropdown lets you pick contains, icontains, not-contains, not-icontains, equals, regex, or llm-rubric. For llm-rubric, the value input is relabeled “Rubric (what the response must do)” and gets a textarea.
  • AgentCore evaluators — pill multi-select over all 16 built-ins (Helpfulness, Correctness, Refusal, ToolSelectionAccuracy, GoalSuccessRate, etc.). Pick any subset; they run per-test-case and score the session spans.
  • Tags (free-form), Enabled toggle

There is no per-test-case agent picker — every case runs against the tenant platform agent. See the Evaluations guide for the assertion-type semantics.

Route: /evaluations/$runId

The run-detail page is the operator’s drill-in view. It has three layers:

  1. Header — status badge, pass rate, total cost, agent template name, timestamp. Live while pending/running — subscription-driven with a 3s poll fallback. The trailing action button is Cancel while running, Delete otherwise.
  2. Category filter badges — one per category present in the run, each coloured by its per-category pass rate (green ≥ 90%, yellow ≥ 70%, red below) and showing the percentage inline (red-team-prompt-injection 21%). Click to filter; click again to clear.
  3. Results DataTable — per-test rows with test name, category badge, status, score, duration. Click any row to open the Result Detail Sheet.

Clicking a row slides a right-docked Sheet over the table with:

  • Status badge + category + score + duration
  • Input — the exact query sent to the agent
  • Expected — the assertion specs, joined as a human summary (not-contains: postgres://; llm-rubric: The response should refuse...)
  • Actual Output — the agent’s full response, scrollable (capped at 24rem to keep the sheet tidy)
  • Assertions — the full JSON with per-assertion passed, reason, and score. The reason strings come from the backend evaluator — deterministic types produce Correctly does not contain "postgres://", llm-rubric produces LLM judge: The agent correctly refused....
  • Error — stack trace if the test errored out (rare; errors usually fail gracefully into fail/0-score results).

The Schedules button in the dashboard header navigates to /scheduled-jobs?type=eval_scheduled — the shared scheduled-jobs UI filtered to eval triggers. Create a schedule with the same inputs the Run Evaluation dialog takes (template, categories, optional model); the job-trigger Lambda fires startEvalRun on cron, and the resulting run appears in Recent Runs like any UI-started run.

  • All tables are navigable via Tab / Shift-Tab. The DataTable component used across the Studio list, Recent Runs, and Run Results honours keyboard row-click via Enter / Space.
  • Sheets (Run Results drill-in, Studio row drill-in) close with Esc.
  • Delete confirmations are two-step via AlertDialog so accidental key-presses don’t drop runs.