Eval Packs

Eval packs let you write systematic evaluations for your agents as code. A dataset is a set of test cases. A scorer is a Python function that grades each response. The CLI runs the eval and writes results to S3 for analysis.

Evals run via EVAL- threads — each test case becomes a real thread invoke against your live agent, so evals measure actual production behavior including skill packs, knowledge bases, and memory.

Eval pack structure

my-eval-pack/
├── eval.yaml           # Pack metadata and configuration
├── dataset.jsonl       # Test cases (one JSON object per line)
├── scorers/
│   ├── accuracy.py     # Custom scorer: checks factual accuracy
│   ├── format.py       # Custom scorer: checks response format
│   └── __init__.py
└── README.md

eval.yaml

name: support-bot-accuracy
version: "1.0.0"
description: "Evaluates Support Bot accuracy on tier-1 support questions"

agent_id: agent-support
dataset: dataset.jsonl

scorers:
  - name: exact_match
    type: built_in       # Use a built-in scorer
  - name: contains_answer
    type: built_in
  - name: accuracy
    type: custom
    module: scorers.accuracy
    function: score_accuracy
  - name: format_check
    type: custom
    module: scorers.format
    function: check_format

# How many test cases to run concurrently
concurrency: 5

# Timeout per test case (seconds)
timeout: 120

# Stop after this many consecutive failures
max_consecutive_failures: 3

Dataset format

Each line in dataset.jsonl is a test case:

{"id": "tc-001", "input": "How do I reset my password?", "expected": "Go to the login page and click 'Forgot Password'", "tags": ["password", "auth"]}
{"id": "tc-002", "input": "What's your refund policy?", "expected": "30-day full refund, no questions asked", "tags": ["billing", "refund"]}
{"id": "tc-003", "input": "How do I export my data?", "expected": "Settings → Data Export → Download CSV", "tags": ["data", "export"]}
{"id": "tc-004", "input": "Is there a mobile app?", "expected": "Yes, available on iOS and Android", "tags": ["mobile", "product"]}

Fields:

Field	Required	Description
`id`	Yes	Unique test case identifier
`input`	Yes	The user message to send to the agent
`expected`	No	Expected answer or ground truth (used by built-in scorers)
`context`	No	Additional context injected into the thread before the input
`tags`	No	Labels for filtering results
`metadata`	No	Arbitrary JSON metadata

Writing a custom scorer

A scorer is a Python function that receives the test case and the agent’s response and returns a score between 0.0 and 1.0.

from typing import TypedDict

class TestCase(TypedDict):
    id: str
    input: str
    expected: str
    tags: list[str]

class AgentResponse(TypedDict):
    body: str           # Full text response
    tool_calls: list    # Any tool calls made
    token_count: int
    duration_ms: int

def score_accuracy(test_case: TestCase, response: AgentResponse) -> float:
    """
    Score the factual accuracy of the response against the expected answer.
    Returns 1.0 for correct, 0.5 for partial, 0.0 for incorrect.
    """
    expected = test_case["expected"].lower().strip()
    actual = response["body"].lower().strip()

    # Exact match
    if expected in actual:
        return 1.0

    # Partial match: all key words present
    key_words = [w for w in expected.split() if len(w) > 3]
    matches = sum(1 for w in key_words if w in actual)
    if matches / len(key_words) >= 0.8:
        return 0.5

    return 0.0

Scorer with LLM grading

For open-ended questions, use an LLM to grade responses:

import boto3
import json

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")

GRADING_PROMPT = """
You are grading an AI assistant's response to a support question.

Question: {question}
Expected answer: {expected}
Actual response: {actual}

Grade the response on a scale from 0 to 10:
- 10: Perfect, complete, and accurate
- 7-9: Mostly correct with minor gaps
- 4-6: Partially correct but missing key information
- 1-3: Mostly incorrect or misleading
- 0: Completely wrong or harmful

Respond with only a JSON object: {{"score": <number>, "reason": "<brief explanation>"}}
"""

def grade_with_llm(test_case, response) -> float:
    prompt = GRADING_PROMPT.format(
        question=test_case["input"],
        expected=test_case["expected"],
        actual=response["body"]
    )

    result = bedrock.invoke_model(
        modelId="anthropic.claude-3-haiku-20240307-v1:0",
        body=json.dumps({
            "anthropic_version": "bedrock-2023-05-31",
            "max_tokens": 256,
            "messages": [{"role": "user", "content": prompt}]
        })
    )

    output = json.loads(result["body"].read())
    grade = json.loads(output["content"][0]["text"])
    return grade["score"] / 10.0   # Normalize to 0.0–1.0

Built-in scorers

Scorer name	Description
`exact_match`	Response contains the exact `expected` string (case-insensitive)
`contains_answer`	Response contains all key phrases from `expected`
`no_refusal`	Response doesn’t contain refusal phrases (“I cannot”, “I’m unable to”)
`json_valid`	Response is valid JSON
`no_hallucination`	Response doesn’t contain phrases that contradict the expected answer
`tool_called`	Agent called at least one tool (non-zero `tool_calls`)
`latency_sla`	Response completed within a threshold (set `threshold_ms` in scorer config)

Running evals

# Run the eval pack against your dev deployment
thinkwork eval run ./my-eval-pack -s dev

# Run only test cases with specific tags
thinkwork eval run ./my-eval-pack -s dev --tags billing,refund

# Run a single test case
thinkwork eval run ./my-eval-pack -s dev --id tc-001

# Dry run: print what would be run without invoking the agent
thinkwork eval run ./my-eval-pack -s dev --dry-run

Output during a run:

Running eval: support-bot-accuracy (32 test cases)
Concurrency: 5

  ✓ tc-001 [1.23s] accuracy=1.0 format=1.0
  ✓ tc-002 [0.98s] accuracy=1.0 format=1.0
  ✗ tc-003 [2.14s] accuracy=0.0 format=1.0  ← FAIL
  ✓ tc-004 [1.05s] accuracy=0.5 format=1.0
  ...

Results (32/32 complete):
  Pass rate:     87.5% (28/32)
  Mean accuracy: 0.84
  Mean format:   0.97
  Mean latency:  1,340ms (p50), 2,890ms (p95)

Results saved to: s3://dev-thinkwork-audit-logs/evals/support-bot-accuracy/2024-04-10T09-23-45Z/

Viewing eval results

Results are stored as JSON in S3 and surfaced in the admin app under Evals → Runs.

# Download results
aws s3 cp s3://dev-thinkwork-audit-logs/evals/support-bot-accuracy/2024-04-10T09-23-45Z/results.json .

# Query results with jq
cat results.json | jq '.cases[] | select(.scores.accuracy < 0.5) | {id, input, response: .response.body}'

Results structure:

{
  "evalPack": "support-bot-accuracy",
  "runId": "eval-run-abc123",
  "agentId": "agent-support",
  "startedAt": "2024-04-10T09:23:45Z",
  "completedAt": "2024-04-10T09:24:12Z",
  "summary": {
    "total": 32,
    "passed": 28,
    "failed": 4,
    "passRate": 0.875,
    "meanScores": {
      "accuracy": 0.84,
      "format": 0.97
    }
  },
  "cases": [
    {
      "id": "tc-001",
      "input": "How do I reset my password?",
      "expected": "Go to the login page and click 'Forgot Password'",
      "response": {
        "body": "To reset your password, navigate to the login page and click the 'Forgot Password' link below the sign-in form.",
        "durationMs": 1230,
        "tokenCount": 142
      },
      "scores": {
        "accuracy": 1.0,
        "format": 1.0
      },
      "passed": true
    }
  ]
}

CI integration

Run evals in your deployment pipeline:

- name: Deploy to staging
  run: thinkwork deploy -s staging --auto-approve

- name: Run eval suite
  run: thinkwork eval run ./evals/support-bot-accuracy -s staging
  env:
    EVAL_FAIL_THRESHOLD: "0.80"  # Fail CI if pass rate < 80%

- name: Promote to production (if evals pass)
  run: thinkwork deploy -s prod --auto-approve