Admin — Evaluations
The Evaluations pages are where operators author test cases, start runs, and drill into per-test results. They sit on top of the same GraphQL + eval-runner Lambda path the thinkwork eval CLI uses, so runs started in the UI show up in the CLI and vice versa.
Route: /evaluations
File: apps/admin/src/routes/_authed/_tenant/evaluations/
See the Evaluations guide for the architecture, assertion types, scoring model, and CLI reference. This page walks through the UI itself.
Dashboard
Section titled “Dashboard”The dashboard at /evaluations is the operator’s at-a-glance view.
- Summary metric cards — Total Runs, Latest Pass Rate, Average Score, Regressions. The pass-rate card colours green ≥ 90%, yellow ≥ 70%, red below. Regressions turns red when > 0.
- Pass Rate Trend chart — the last 30 days, zero-filled so the x-axis stays consistent even when there are only a handful of runs. Backed by
EvalTimeSeriesQuery. - Recent Runs table — every completed/running/pending run for the tenant. Columns: status, categories, template, model, tests, pass rate, cost, date. Click any row to open the Run Results page for that run.
The header has three actions:
- Studio →
/evaluations/studio— the test-case CRUD surface. - Schedules → the shared Scheduled Jobs page filtered to
trigger_type: "eval_scheduled". - Run Evaluation → opens the Run Evaluation dialog.
Categories column
Section titled “Categories column”The Categories column uses a smart renderer:
—when a run has no categories- “All Categories” when the run covered every category the tenant has
- the bare category name when there’s only one
- “N Categories” with the count otherwise
Model column
Section titled “Model column”The model column shortens Bedrock IDs so us.anthropic.claude-haiku-4-5-20251001-v1:0 renders as claude-haiku-4-5. Prefix / version / date suffixes are stripped.
Run Evaluation dialog
Section titled “Run Evaluation dialog”The dialog is where a run gets started. It has four inputs, in this order:
| Field | Purpose |
|---|---|
| Agent template (required) | The eval test agent is a generic AgentCore runtime; this template determines the workspace, tools, and default model it loads. Different templates expose different tool surfaces — that matters for tests like “should refuse to web-search.” |
| Model | Optional override of the template’s default model. Leave blank to use the template’s. |
| Invocation Mode | End-to-End (full agent runtime) is the default and the only mode currently wired. Direct (Bedrock only) is a UI-only scaffold for a future path that skips the agent runtime. |
| Categories | Multi-select pills. Click All Categories to run every enabled test case; click individual chips to run a subset. |
Start Evaluation is disabled until a template is picked. On submit the dialog calls startEvalRun, the row shows up in Recent Runs as pending, and the eval-runner Lambda picks it up asynchronously.
Studio — test case list
Section titled “Studio — test case list”The Studio at /evaluations/studio is the test-case CRUD surface. Every row is a test case stored in eval_test_cases; the row lives per-tenant and carries the assertion + evaluator config.
Route: /evaluations/studio
Columns: Name (clickable → Test Case detail), Category, Evaluators (count), Assertions (count), Enabled toggle, Updated.
Actions:
- Import starter pack — calls the
seedEvalTestCasesmutation to idempotently import the 96-test maniflow pack across 9 categories. Re-runs are safe (unique index on(tenant_id, name)forsource='yaml-seed'rows). - New test case →
/evaluations/studio/new— the form described below. - Search by name — free-text filter over
evalTestCases(tenantId, search). - Trash icon per row —
deleteEvalTestCasemutation, confirms first.
First-visit auto-seed
Section titled “First-visit auto-seed”The Studio auto-seeds the 96-case starter pack on a tenant’s first visit (the evalTestCases query checks for any source='yaml-seed' rows and imports if zero). The seed is idempotent so re-visits don’t duplicate.
Test Case detail
Section titled “Test Case detail”Route: /evaluations/studio/$testCaseId
Two sections:
- Test Configuration — query, assertions list (each with its type badge and value), AgentCore evaluators, tags. Read-only snapshot; click Edit in the header to open the editor.
- Run History — DataTable of every
eval_resultsrow for this test case across runs. Click a row to open the side-docked Sheet with the same breakdown the Run Results page uses (Input / Expected / Actual Output / Assertions).
The header shows the Enabled badge plus Edit and Delete actions.
Test Case editor
Section titled “Test Case editor”Route: /evaluations/studio/edit/$testCaseId (or /evaluations/studio/new)
The editor is a react-hook-form + zod flow covering every field on an eval_test_cases row:
- Name, Category, Query, optional System Prompt
- Assertions — typed repeater. Dropdown lets you pick
contains,icontains,not-contains,not-icontains,equals,regex, orllm-rubric. Forllm-rubric, the value input is relabeled “Rubric (what the response must do)” and gets a textarea. - AgentCore evaluators — pill multi-select over all 16 built-ins (Helpfulness, Correctness, Refusal, ToolSelectionAccuracy, GoalSuccessRate, etc.). Pick any subset; they run per-test-case and score the session spans.
- Tags (free-form), Enabled toggle, optional Agent template override
See the Evaluations guide for the assertion-type semantics.
Run Results
Section titled “Run Results”Route: /evaluations/$runId
The run-detail page is the operator’s drill-in view. It has three layers:
- Header — status badge, pass rate, total cost, agent template name, timestamp. Live while
pending/running— subscription-driven with a 3s poll fallback. The trailing action button is Cancel while running, Delete otherwise. - Category filter badges — one per category present in the run, each coloured by its per-category pass rate (green ≥ 90%, yellow ≥ 70%, red below) and showing the percentage inline (
red-team 21%). Click to filter; click again to clear. - Results DataTable — per-test rows with test name, category badge, status, score, duration. Click any row to open the Result Detail Sheet.
Result Detail Sheet
Section titled “Result Detail Sheet”Clicking a row slides a right-docked Sheet over the table with:
- Status badge + category + score + duration
- Input — the exact query sent to the agent
- Expected — the assertion specs, joined as a human summary (
not-contains: postgres://; llm-rubric: The response should refuse...) - Actual Output — the agent’s full response, scrollable (capped at 24rem to keep the sheet tidy)
- Assertions — the full JSON with per-assertion
passed,reason, andscore. The reason strings come from the backend evaluator — deterministic types produceCorrectly does not contain "postgres://",llm-rubricproducesLLM judge: The agent correctly refused.... - Error — stack trace if the test errored out (rare; errors usually fail gracefully into fail/0-score results).
Scheduled runs
Section titled “Scheduled runs”The Schedules button in the dashboard header navigates to /scheduled-jobs?type=eval_scheduled — the shared scheduled-jobs UI filtered to eval triggers. Create a schedule with the same inputs the Run Evaluation dialog takes (template, categories, optional model); the job-trigger Lambda fires startEvalRun on cron, and the resulting run appears in Recent Runs like any UI-started run.
Keyboard / accessibility
Section titled “Keyboard / accessibility”- All tables are navigable via Tab / Shift-Tab. The DataTable component used across the Studio list, Recent Runs, and Run Results honours keyboard row-click via Enter / Space.
- Sheets (Run Results drill-in, Studio row drill-in) close with
Esc. - Delete confirmations are two-step via
AlertDialogso accidental key-presses don’t drop runs.
Related
Section titled “Related”- Evaluations guide — architecture, assertion types, scoring, CLI reference
- Automations — the cron UI that powers eval schedules
- Agent Templates — the templates the Run Evaluation dialog picks from