Admin — Evaluations
The Evaluations pages are where operators author test cases, start runs, and drill into per-test results. They sit on top of the same GraphQL + eval-runner Lambda path the thinkwork eval CLI uses, so runs started in the UI show up in the CLI and vice versa.
Route: /evaluations
File: apps/web/src/routes/_authed/_tenant/evaluations/
See the Evaluations guide for the architecture, assertion types, scoring model, and CLI reference. This page walks through the UI itself.
Agent target
Section titled “Agent target”Every eval run executes against the tenant’s platform agent — the single agents row with is_platform_default = true. There is no per-run or per-test-case agent picker; “the tenant agent” is unambiguous under the one-platform-agent model.
If a tenant has no is_platform_default = true row (the platform-agent collapse migration has not yet been run on that stage), startEvalRun records the run with status='failed' and an error_message containing PlatformAgentNotFoundError. Complete the migration described in docs/plans/2026-05-22-005-refactor-single-platform-agent-and-space-runtime-overrides-plan.md before retrying.
Cost attribution: cost_events.agent_id on eval-source events now equals the platform agent’s id (it previously equalled a dedicated type='eval', source='system' row that has been archived). Cost summaries filtered by event_type='eval' continue to work unchanged.
Dashboard
Section titled “Dashboard”The dashboard at /evaluations is the operator’s at-a-glance view.
- Summary metric cards — Total Runs, Latest Pass Rate, Average Score, Regressions. The pass-rate card colours green ≥ 90%, yellow ≥ 70%, red below. Regressions turns red when > 0.
- Pass Rate Trend chart — the last 30 days, zero-filled so the x-axis stays consistent even when there are only a handful of runs. Backed by
EvalTimeSeriesQuery. - Recent Runs table — every completed/running/pending run for the tenant. Columns: status, categories, template, model, tests, pass rate, cost, date. Click any row to open the Run Results page for that run.
The header has three actions:
- Studio →
/evaluations/studio— the test-case CRUD surface. - Schedules → the shared Scheduled Jobs page filtered to
trigger_type: "eval_scheduled". - Run Evaluation → opens the Run Evaluation dialog.
Categories column
Section titled “Categories column”The Categories column uses a smart renderer:
—when a run has no categories- “All Categories” when the run covered every category the tenant has
- the bare category name when there’s only one
- “N Categories” with the count otherwise
Model column
Section titled “Model column”The model column shortens Bedrock IDs so us.anthropic.claude-haiku-4-5-20251001-v1:0 renders as claude-haiku-4-5. Prefix / version / date suffixes are stripped.
Run Evaluation dialog
Section titled “Run Evaluation dialog”The dialog is where a run gets started. It has two inputs:
| Field | Purpose |
|---|---|
| Model | Optional override of the platform agent’s default eval model. Overrides must be enabled in the tenant Model Catalog. |
| Categories | Multi-select pills. Click All Categories to run every enabled test case; click individual chips to run a subset. |
On submit the dialog calls startEvalRun, the row shows up in Recent Runs as pending, and the eval-runner Lambda picks it up asynchronously. The invocation target is always the tenant platform agent (see Agent target above).
If an override is absent from the tenant catalog or has been disabled there, the API rejects the run instead of falling back to a globally seeded model.
Studio — test case list
Section titled “Studio — test case list”The Studio at /evaluations/studio is the test-case CRUD surface. Every row is a test case stored in eval_test_cases; the row lives per-tenant and carries the assertion + evaluator config.
Route: /evaluations/studio
Columns: Name (clickable → Test Case detail), Category, Evaluators (count), Assertions (count), Enabled toggle, Updated.
Actions:
- Import starter pack — calls the
seedEvalTestCasesmutation to idempotently import the 189-test ThinkWork RedTeam pack across 4 categories. Re-runs are safe (unique index on(tenant_id, name)forsource='yaml-seed'rows). - New test case →
/evaluations/studio/new— the form described below. - Search by name — free-text filter over
evalTestCases(tenantId, search). - Trash icon per row —
deleteEvalTestCasemutation, confirms first.
First-visit auto-seed
Section titled “First-visit auto-seed”The Studio auto-seeds the 189-case starter pack on a tenant’s first visit (the evalTestCases query checks for any source='yaml-seed' rows and imports if zero). The seed is idempotent so re-visits don’t duplicate.
Test Case detail
Section titled “Test Case detail”Route: /evaluations/studio/$testCaseId
Two sections:
- Test Configuration — query, assertions list (each with its type badge and value), AgentCore evaluators, tags. Read-only snapshot; click Edit in the header to open the editor.
- Run History — DataTable of every
eval_resultsrow for this test case across runs. Click a row to open the side-docked Sheet with the same breakdown the Run Results page uses (Input / Expected / Actual Output / Assertions).
The header shows the Enabled badge plus Edit and Delete actions.
Test Case editor
Section titled “Test Case editor”Route: /evaluations/studio/edit/$testCaseId (or /evaluations/studio/new)
The editor is a react-hook-form + zod flow covering every field on an eval_test_cases row:
- Name, Category, Query, optional System Prompt
- Assertions — typed repeater. Dropdown lets you pick
contains,icontains,not-contains,not-icontains,equals,regex, orllm-rubric. Forllm-rubric, the value input is relabeled “Rubric (what the response must do)” and gets a textarea. - AgentCore evaluators — pill multi-select over all 16 built-ins (Helpfulness, Correctness, Refusal, ToolSelectionAccuracy, GoalSuccessRate, etc.). Pick any subset; they run per-test-case and score the session spans.
- Tags (free-form), Enabled toggle
There is no per-test-case agent picker — every case runs against the tenant platform agent. See the Evaluations guide for the assertion-type semantics.
Run Results
Section titled “Run Results”Route: /evaluations/$runId
The run-detail page is the operator’s drill-in view. It has three layers:
- Header — status badge, pass rate, total cost, agent template name, timestamp. Live while
pending/running— subscription-driven with a 3s poll fallback. The trailing action button is Cancel while running, Delete otherwise. - Category filter badges — one per category present in the run, each coloured by its per-category pass rate (green ≥ 90%, yellow ≥ 70%, red below) and showing the percentage inline (
red-team-prompt-injection 21%). Click to filter; click again to clear. - Results DataTable — per-test rows with test name, category badge, status, score, duration. Click any row to open the Result Detail Sheet.
Result Detail Sheet
Section titled “Result Detail Sheet”Clicking a row slides a right-docked Sheet over the table with:
- Status badge + category + score + duration
- Input — the exact query sent to the agent
- Expected — the assertion specs, joined as a human summary (
not-contains: postgres://; llm-rubric: The response should refuse...) - Actual Output — the agent’s full response, scrollable (capped at 24rem to keep the sheet tidy)
- Assertions — the full JSON with per-assertion
passed,reason, andscore. The reason strings come from the backend evaluator — deterministic types produceCorrectly does not contain "postgres://",llm-rubricproducesLLM judge: The agent correctly refused.... - Error — stack trace if the test errored out (rare; errors usually fail gracefully into fail/0-score results).
Scheduled runs
Section titled “Scheduled runs”The Schedules button in the dashboard header navigates to /scheduled-jobs?type=eval_scheduled — the shared scheduled-jobs UI filtered to eval triggers. Create a schedule with the same inputs the Run Evaluation dialog takes (template, categories, optional model); the job-trigger Lambda fires startEvalRun on cron, and the resulting run appears in Recent Runs like any UI-started run.
Keyboard / accessibility
Section titled “Keyboard / accessibility”- All tables are navigable via Tab / Shift-Tab. The DataTable component used across the Studio list, Recent Runs, and Run Results honours keyboard row-click via Enter / Space.
- Sheets (Run Results drill-in, Studio row drill-in) close with
Esc. - Delete confirmations are two-step via
AlertDialogso accidental key-presses don’t drop runs.
Related
Section titled “Related”- Evaluations guide — architecture, assertion types, scoring, CLI reference
- Automations — the cron UI that powers eval schedules
- Agent Templates — the templates the Run Evaluation dialog picks from