Skip to content

Eval Packs

Eval packs let you write systematic evaluations for your agents as code. A dataset is a set of test cases. A scorer is a Python function that grades each response. The CLI runs the eval and writes results to S3 for analysis.

Evals run via EVAL- threads — each test case becomes a real thread invoke against your live agent, so evals measure actual production behavior including skill packs, knowledge bases, and memory.

my-eval-pack/
├── eval.yaml # Pack metadata and configuration
├── dataset.jsonl # Test cases (one JSON object per line)
├── scorers/
│ ├── accuracy.py # Custom scorer: checks factual accuracy
│ ├── format.py # Custom scorer: checks response format
│ └── __init__.py
└── README.md
name: support-bot-accuracy
version: "1.0.0"
description: "Evaluates Support Bot accuracy on tier-1 support questions"
agent_id: agent-support
dataset: dataset.jsonl
scorers:
- name: exact_match
type: built_in # Use a built-in scorer
- name: contains_answer
type: built_in
- name: accuracy
type: custom
module: scorers.accuracy
function: score_accuracy
- name: format_check
type: custom
module: scorers.format
function: check_format
# How many test cases to run concurrently
concurrency: 5
# Timeout per test case (seconds)
timeout: 120
# Stop after this many consecutive failures
max_consecutive_failures: 3

Each line in dataset.jsonl is a test case:

{"id": "tc-001", "input": "How do I reset my password?", "expected": "Go to the login page and click 'Forgot Password'", "tags": ["password", "auth"]}
{"id": "tc-002", "input": "What's your refund policy?", "expected": "30-day full refund, no questions asked", "tags": ["billing", "refund"]}
{"id": "tc-003", "input": "How do I export my data?", "expected": "Settings → Data Export → Download CSV", "tags": ["data", "export"]}
{"id": "tc-004", "input": "Is there a mobile app?", "expected": "Yes, available on iOS and Android", "tags": ["mobile", "product"]}

Fields:

FieldRequiredDescription
idYesUnique test case identifier
inputYesThe user message to send to the agent
expectedNoExpected answer or ground truth (used by built-in scorers)
contextNoAdditional context injected into the thread before the input
tagsNoLabels for filtering results
metadataNoArbitrary JSON metadata

A scorer is a Python function that receives the test case and the agent’s response and returns a score between 0.0 and 1.0.

scorers/accuracy.py
from typing import TypedDict
class TestCase(TypedDict):
id: str
input: str
expected: str
tags: list[str]
class AgentResponse(TypedDict):
body: str # Full text response
tool_calls: list # Any tool calls made
token_count: int
duration_ms: int
def score_accuracy(test_case: TestCase, response: AgentResponse) -> float:
"""
Score the factual accuracy of the response against the expected answer.
Returns 1.0 for correct, 0.5 for partial, 0.0 for incorrect.
"""
expected = test_case["expected"].lower().strip()
actual = response["body"].lower().strip()
# Exact match
if expected in actual:
return 1.0
# Partial match: all key words present
key_words = [w for w in expected.split() if len(w) > 3]
matches = sum(1 for w in key_words if w in actual)
if matches / len(key_words) >= 0.8:
return 0.5
return 0.0

For open-ended questions, use an LLM to grade responses:

scorers/llm_grader.py
import boto3
import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
GRADING_PROMPT = """
You are grading an AI assistant's response to a support question.
Question: {question}
Expected answer: {expected}
Actual response: {actual}
Grade the response on a scale from 0 to 10:
- 10: Perfect, complete, and accurate
- 7-9: Mostly correct with minor gaps
- 4-6: Partially correct but missing key information
- 1-3: Mostly incorrect or misleading
- 0: Completely wrong or harmful
Respond with only a JSON object: {{"score": <number>, "reason": "<brief explanation>"}}
"""
def grade_with_llm(test_case, response) -> float:
prompt = GRADING_PROMPT.format(
question=test_case["input"],
expected=test_case["expected"],
actual=response["body"]
)
result = bedrock.invoke_model(
modelId="anthropic.claude-3-haiku-20240307-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 256,
"messages": [{"role": "user", "content": prompt}]
})
)
output = json.loads(result["body"].read())
grade = json.loads(output["content"][0]["text"])
return grade["score"] / 10.0 # Normalize to 0.0–1.0
Scorer nameDescription
exact_matchResponse contains the exact expected string (case-insensitive)
contains_answerResponse contains all key phrases from expected
no_refusalResponse doesn’t contain refusal phrases (“I cannot”, “I’m unable to”)
json_validResponse is valid JSON
no_hallucinationResponse doesn’t contain phrases that contradict the expected answer
tool_calledAgent called at least one tool (non-zero tool_calls)
latency_slaResponse completed within a threshold (set threshold_ms in scorer config)
Terminal window
# Run the eval pack against your dev deployment
thinkwork eval run ./my-eval-pack -s dev
# Run only test cases with specific tags
thinkwork eval run ./my-eval-pack -s dev --tags billing,refund
# Run a single test case
thinkwork eval run ./my-eval-pack -s dev --id tc-001
# Dry run: print what would be run without invoking the agent
thinkwork eval run ./my-eval-pack -s dev --dry-run

Output during a run:

Running eval: support-bot-accuracy (32 test cases)
Concurrency: 5
✓ tc-001 [1.23s] accuracy=1.0 format=1.0
✓ tc-002 [0.98s] accuracy=1.0 format=1.0
✗ tc-003 [2.14s] accuracy=0.0 format=1.0 ← FAIL
✓ tc-004 [1.05s] accuracy=0.5 format=1.0
...
Results (32/32 complete):
Pass rate: 87.5% (28/32)
Mean accuracy: 0.84
Mean format: 0.97
Mean latency: 1,340ms (p50), 2,890ms (p95)
Results saved to: s3://dev-thinkwork-audit-logs/evals/support-bot-accuracy/2024-04-10T09-23-45Z/

Results are stored as JSON in S3 and surfaced in the admin app under Evals → Runs.

Terminal window
# Download results
aws s3 cp s3://dev-thinkwork-audit-logs/evals/support-bot-accuracy/2024-04-10T09-23-45Z/results.json .
# Query results with jq
cat results.json | jq '.cases[] | select(.scores.accuracy < 0.5) | {id, input, response: .response.body}'

Results structure:

{
"evalPack": "support-bot-accuracy",
"runId": "eval-run-abc123",
"agentId": "agent-support",
"startedAt": "2024-04-10T09:23:45Z",
"completedAt": "2024-04-10T09:24:12Z",
"summary": {
"total": 32,
"passed": 28,
"failed": 4,
"passRate": 0.875,
"meanScores": {
"accuracy": 0.84,
"format": 0.97
}
},
"cases": [
{
"id": "tc-001",
"input": "How do I reset my password?",
"expected": "Go to the login page and click 'Forgot Password'",
"response": {
"body": "To reset your password, navigate to the login page and click the 'Forgot Password' link below the sign-in form.",
"durationMs": 1230,
"tokenCount": 142
},
"scores": {
"accuracy": 1.0,
"format": 1.0
},
"passed": true
}
]
}

Run evals in your deployment pipeline:

.github/workflows/deploy.yml
- name: Deploy to staging
run: thinkwork deploy -s staging --auto-approve
- name: Run eval suite
run: thinkwork eval run ./evals/support-bot-accuracy -s staging
env:
EVAL_FAIL_THRESHOLD: "0.80" # Fail CI if pass rate < 80%
- name: Promote to production (if evals pass)
run: thinkwork deploy -s prod --auto-approve