Admin — Evaluations

The Evaluations pages are where operators author test cases, start runs, and drill into per-test results. They sit on top of the same GraphQL + eval-runner Lambda path the thinkwork eval CLI uses, so runs started in the UI show up in the CLI and vice versa.

Route: /evaluations File: apps/admin/src/routes/_authed/_tenant/evaluations/

See the Evaluations guide for the architecture, assertion types, scoring model, and CLI reference. This page walks through the UI itself.

Dashboard

The dashboard at /evaluations is the operator’s at-a-glance view.

Summary metric cards — Total Runs, Latest Pass Rate, Average Score, Regressions. The pass-rate card colours green ≥ 90%, yellow ≥ 70%, red below. Regressions turns red when > 0.
Pass Rate Trend chart — the last 30 days, zero-filled so the x-axis stays consistent even when there are only a handful of runs. Backed by EvalTimeSeriesQuery.
Recent Runs table — every completed/running/pending run for the tenant. Columns: status, categories, template, model, tests, pass rate, cost, date. Click any row to open the Run Results page for that run.

The header has three actions:

Studio → /evaluations/studio — the test-case CRUD surface.
Schedules → the shared Scheduled Jobs page filtered to trigger_type: "eval_scheduled".
Run Evaluation → opens the Run Evaluation dialog.

Categories column

The Categories column uses a smart renderer:

— when a run has no categories
“All Categories” when the run covered every category the tenant has
the bare category name when there’s only one
“N Categories” with the count otherwise

Model column

The model column shortens Bedrock IDs so us.anthropic.claude-haiku-4-5-20251001-v1:0 renders as claude-haiku-4-5. Prefix / version / date suffixes are stripped.

Run Evaluation dialog

The dialog is where a run gets started. It has four inputs, in this order:

Field	Purpose
Agent template (required)	The eval test agent is a generic AgentCore runtime; this template determines the workspace, tools, and default model it loads. Different templates expose different tool surfaces — that matters for tests like “should refuse to web-search.”
Model	Optional override of the template’s default model. Leave blank to use the template’s.
Invocation Mode	`End-to-End (full agent runtime)` is the default and the only mode currently wired. `Direct (Bedrock only)` is a UI-only scaffold for a future path that skips the agent runtime.
Categories	Multi-select pills. Click All Categories to run every enabled test case; click individual chips to run a subset.

Start Evaluation is disabled until a template is picked. On submit the dialog calls startEvalRun, the row shows up in Recent Runs as pending, and the eval-runner Lambda picks it up asynchronously.

Studio — test case list

The Studio at /evaluations/studio is the test-case CRUD surface. Every row is a test case stored in eval_test_cases; the row lives per-tenant and carries the assertion + evaluator config.

Route: /evaluations/studio

Columns: Name (clickable → Test Case detail), Category, Evaluators (count), Assertions (count), Enabled toggle, Updated.

Actions:

Import starter pack — calls the seedEvalTestCases mutation to idempotently import the 96-test maniflow pack across 9 categories. Re-runs are safe (unique index on (tenant_id, name) for source='yaml-seed' rows).
New test case → /evaluations/studio/new — the form described below.
Search by name — free-text filter over evalTestCases(tenantId, search).
Trash icon per row — deleteEvalTestCase mutation, confirms first.

First-visit auto-seed

The Studio auto-seeds the 96-case starter pack on a tenant’s first visit (the evalTestCases query checks for any source='yaml-seed' rows and imports if zero). The seed is idempotent so re-visits don’t duplicate.

Test Case detail

Route: /evaluations/studio/$testCaseId

Two sections:

Test Configuration — query, assertions list (each with its type badge and value), AgentCore evaluators, tags. Read-only snapshot; click Edit in the header to open the editor.
Run History — DataTable of every eval_results row for this test case across runs. Click a row to open the side-docked Sheet with the same breakdown the Run Results page uses (Input / Expected / Actual Output / Assertions).

The header shows the Enabled badge plus Edit and Delete actions.

Test Case editor

Route: /evaluations/studio/edit/$testCaseId (or /evaluations/studio/new)

The editor is a react-hook-form + zod flow covering every field on an eval_test_cases row:

Name, Category, Query, optional System Prompt
Assertions — typed repeater. Dropdown lets you pick contains, icontains, not-contains, not-icontains, equals, regex, or llm-rubric. For llm-rubric, the value input is relabeled “Rubric (what the response must do)” and gets a textarea.
AgentCore evaluators — pill multi-select over all 16 built-ins (Helpfulness, Correctness, Refusal, ToolSelectionAccuracy, GoalSuccessRate, etc.). Pick any subset; they run per-test-case and score the session spans.
Tags (free-form), Enabled toggle, optional Agent template override

See the Evaluations guide for the assertion-type semantics.

Run Results

Route: /evaluations/$runId

The run-detail page is the operator’s drill-in view. It has three layers:

Header — status badge, pass rate, total cost, agent template name, timestamp. Live while pending/running — subscription-driven with a 3s poll fallback. The trailing action button is Cancel while running, Delete otherwise.
Category filter badges — one per category present in the run, each coloured by its per-category pass rate (green ≥ 90%, yellow ≥ 70%, red below) and showing the percentage inline (red-team 21%). Click to filter; click again to clear.
Results DataTable — per-test rows with test name, category badge, status, score, duration. Click any row to open the Result Detail Sheet.

Result Detail Sheet

Clicking a row slides a right-docked Sheet over the table with:

Status badge + category + score + duration
Input — the exact query sent to the agent
Expected — the assertion specs, joined as a human summary (not-contains: postgres://; llm-rubric: The response should refuse...)
Actual Output — the agent’s full response, scrollable (capped at 24rem to keep the sheet tidy)
Assertions — the full JSON with per-assertion passed, reason, and score. The reason strings come from the backend evaluator — deterministic types produce Correctly does not contain "postgres://", llm-rubric produces LLM judge: The agent correctly refused....
Error — stack trace if the test errored out (rare; errors usually fail gracefully into fail/0-score results).

Scheduled runs

The Schedules button in the dashboard header navigates to /scheduled-jobs?type=eval_scheduled — the shared scheduled-jobs UI filtered to eval triggers. Create a schedule with the same inputs the Run Evaluation dialog takes (template, categories, optional model); the job-trigger Lambda fires startEvalRun on cron, and the resulting run appears in Recent Runs like any UI-started run.

Keyboard / accessibility

All tables are navigable via Tab / Shift-Tab. The DataTable component used across the Studio list, Recent Runs, and Run Results honours keyboard row-click via Enter / Space.
Sheets (Run Results drill-in, Studio row drill-in) close with Esc.
Delete confirmations are two-step via AlertDialog so accidental key-presses don’t drop runs.

Evaluations guide — architecture, assertion types, scoring, CLI reference
Automations — the cron UI that powers eval schedules
Agent Templates — the templates the Run Evaluation dialog picks from