Skip to content

Admin — Evaluations

The Evaluations pages are where operators author test cases, start runs, and drill into per-test results. They sit on top of the same GraphQL + eval-runner Lambda path the thinkwork eval CLI uses, so runs started in the UI show up in the CLI and vice versa.

Route: /evaluations File: apps/admin/src/routes/_authed/_tenant/evaluations/

See the Evaluations guide for the architecture, assertion types, scoring model, and CLI reference. This page walks through the UI itself.

The dashboard at /evaluations is the operator’s at-a-glance view.

  • Summary metric cards — Total Runs, Latest Pass Rate, Average Score, Regressions. The pass-rate card colours green ≥ 90%, yellow ≥ 70%, red below. Regressions turns red when > 0.
  • Pass Rate Trend chart — the last 30 days, zero-filled so the x-axis stays consistent even when there are only a handful of runs. Backed by EvalTimeSeriesQuery.
  • Recent Runs table — every completed/running/pending run for the tenant. Columns: status, categories, template, model, tests, pass rate, cost, date. Click any row to open the Run Results page for that run.

The header has three actions:

  • Studio/evaluations/studio — the test-case CRUD surface.
  • Schedules → the shared Scheduled Jobs page filtered to trigger_type: "eval_scheduled".
  • Run Evaluation → opens the Run Evaluation dialog.

The Categories column uses a smart renderer:

  • when a run has no categories
  • “All Categories” when the run covered every category the tenant has
  • the bare category name when there’s only one
  • “N Categories” with the count otherwise

The model column shortens Bedrock IDs so us.anthropic.claude-haiku-4-5-20251001-v1:0 renders as claude-haiku-4-5. Prefix / version / date suffixes are stripped.

The dialog is where a run gets started. It has four inputs, in this order:

FieldPurpose
Agent template (required)The eval test agent is a generic AgentCore runtime; this template determines the workspace, tools, and default model it loads. Different templates expose different tool surfaces — that matters for tests like “should refuse to web-search.”
ModelOptional override of the template’s default model. Leave blank to use the template’s.
Invocation ModeEnd-to-End (full agent runtime) is the default and the only mode currently wired. Direct (Bedrock only) is a UI-only scaffold for a future path that skips the agent runtime.
CategoriesMulti-select pills. Click All Categories to run every enabled test case; click individual chips to run a subset.

Start Evaluation is disabled until a template is picked. On submit the dialog calls startEvalRun, the row shows up in Recent Runs as pending, and the eval-runner Lambda picks it up asynchronously.

The Studio at /evaluations/studio is the test-case CRUD surface. Every row is a test case stored in eval_test_cases; the row lives per-tenant and carries the assertion + evaluator config.

Route: /evaluations/studio

Columns: Name (clickable → Test Case detail), Category, Evaluators (count), Assertions (count), Enabled toggle, Updated.

Actions:

  • Import starter pack — calls the seedEvalTestCases mutation to idempotently import the 96-test maniflow pack across 9 categories. Re-runs are safe (unique index on (tenant_id, name) for source='yaml-seed' rows).
  • New test case/evaluations/studio/new — the form described below.
  • Search by name — free-text filter over evalTestCases(tenantId, search).
  • Trash icon per row — deleteEvalTestCase mutation, confirms first.

The Studio auto-seeds the 96-case starter pack on a tenant’s first visit (the evalTestCases query checks for any source='yaml-seed' rows and imports if zero). The seed is idempotent so re-visits don’t duplicate.

Route: /evaluations/studio/$testCaseId

Two sections:

  • Test Configuration — query, assertions list (each with its type badge and value), AgentCore evaluators, tags. Read-only snapshot; click Edit in the header to open the editor.
  • Run History — DataTable of every eval_results row for this test case across runs. Click a row to open the side-docked Sheet with the same breakdown the Run Results page uses (Input / Expected / Actual Output / Assertions).

The header shows the Enabled badge plus Edit and Delete actions.

Route: /evaluations/studio/edit/$testCaseId (or /evaluations/studio/new)

The editor is a react-hook-form + zod flow covering every field on an eval_test_cases row:

  • Name, Category, Query, optional System Prompt
  • Assertions — typed repeater. Dropdown lets you pick contains, icontains, not-contains, not-icontains, equals, regex, or llm-rubric. For llm-rubric, the value input is relabeled “Rubric (what the response must do)” and gets a textarea.
  • AgentCore evaluators — pill multi-select over all 16 built-ins (Helpfulness, Correctness, Refusal, ToolSelectionAccuracy, GoalSuccessRate, etc.). Pick any subset; they run per-test-case and score the session spans.
  • Tags (free-form), Enabled toggle, optional Agent template override

See the Evaluations guide for the assertion-type semantics.

Route: /evaluations/$runId

The run-detail page is the operator’s drill-in view. It has three layers:

  1. Header — status badge, pass rate, total cost, agent template name, timestamp. Live while pending/running — subscription-driven with a 3s poll fallback. The trailing action button is Cancel while running, Delete otherwise.
  2. Category filter badges — one per category present in the run, each coloured by its per-category pass rate (green ≥ 90%, yellow ≥ 70%, red below) and showing the percentage inline (red-team 21%). Click to filter; click again to clear.
  3. Results DataTable — per-test rows with test name, category badge, status, score, duration. Click any row to open the Result Detail Sheet.

Clicking a row slides a right-docked Sheet over the table with:

  • Status badge + category + score + duration
  • Input — the exact query sent to the agent
  • Expected — the assertion specs, joined as a human summary (not-contains: postgres://; llm-rubric: The response should refuse...)
  • Actual Output — the agent’s full response, scrollable (capped at 24rem to keep the sheet tidy)
  • Assertions — the full JSON with per-assertion passed, reason, and score. The reason strings come from the backend evaluator — deterministic types produce Correctly does not contain "postgres://", llm-rubric produces LLM judge: The agent correctly refused....
  • Error — stack trace if the test errored out (rare; errors usually fail gracefully into fail/0-score results).

The Schedules button in the dashboard header navigates to /scheduled-jobs?type=eval_scheduled — the shared scheduled-jobs UI filtered to eval triggers. Create a schedule with the same inputs the Run Evaluation dialog takes (template, categories, optional model); the job-trigger Lambda fires startEvalRun on cron, and the resulting run appears in Recent Runs like any UI-started run.

  • All tables are navigable via Tab / Shift-Tab. The DataTable component used across the Studio list, Recent Runs, and Run Results honours keyboard row-click via Enter / Space.
  • Sheets (Run Results drill-in, Studio row drill-in) close with Esc.
  • Delete confirmations are two-step via AlertDialog so accidental key-presses don’t drop runs.