Eval Packs
Eval packs let you write systematic evaluations for your agents as code. A dataset is a set of test cases. A scorer is a Python function that grades each response. The CLI runs the eval and writes results to S3 for analysis.
Evals run via EVAL- threads — each test case becomes a real thread invoke against your live agent, so evals measure actual production behavior including skill packs, knowledge bases, and memory.
Eval pack structure
Section titled “Eval pack structure”my-eval-pack/├── eval.yaml # Pack metadata and configuration├── dataset.jsonl # Test cases (one JSON object per line)├── scorers/│ ├── accuracy.py # Custom scorer: checks factual accuracy│ ├── format.py # Custom scorer: checks response format│ └── __init__.py└── README.mdeval.yaml
Section titled “eval.yaml”name: support-bot-accuracyversion: "1.0.0"description: "Evaluates Support Bot accuracy on tier-1 support questions"
agent_id: agent-supportdataset: dataset.jsonl
scorers: - name: exact_match type: built_in # Use a built-in scorer - name: contains_answer type: built_in - name: accuracy type: custom module: scorers.accuracy function: score_accuracy - name: format_check type: custom module: scorers.format function: check_format
# How many test cases to run concurrentlyconcurrency: 5
# Timeout per test case (seconds)timeout: 120
# Stop after this many consecutive failuresmax_consecutive_failures: 3Dataset format
Section titled “Dataset format”Each line in dataset.jsonl is a test case:
{"id": "tc-001", "input": "How do I reset my password?", "expected": "Go to the login page and click 'Forgot Password'", "tags": ["password", "auth"]}{"id": "tc-002", "input": "What's your refund policy?", "expected": "30-day full refund, no questions asked", "tags": ["billing", "refund"]}{"id": "tc-003", "input": "How do I export my data?", "expected": "Settings → Data Export → Download CSV", "tags": ["data", "export"]}{"id": "tc-004", "input": "Is there a mobile app?", "expected": "Yes, available on iOS and Android", "tags": ["mobile", "product"]}Fields:
| Field | Required | Description |
|---|---|---|
id | Yes | Unique test case identifier |
input | Yes | The user message to send to the agent |
expected | No | Expected answer or ground truth (used by built-in scorers) |
context | No | Additional context injected into the thread before the input |
tags | No | Labels for filtering results |
metadata | No | Arbitrary JSON metadata |
Writing a custom scorer
Section titled “Writing a custom scorer”A scorer is a Python function that receives the test case and the agent’s response and returns a score between 0.0 and 1.0.
from typing import TypedDict
class TestCase(TypedDict): id: str input: str expected: str tags: list[str]
class AgentResponse(TypedDict): body: str # Full text response tool_calls: list # Any tool calls made token_count: int duration_ms: int
def score_accuracy(test_case: TestCase, response: AgentResponse) -> float: """ Score the factual accuracy of the response against the expected answer. Returns 1.0 for correct, 0.5 for partial, 0.0 for incorrect. """ expected = test_case["expected"].lower().strip() actual = response["body"].lower().strip()
# Exact match if expected in actual: return 1.0
# Partial match: all key words present key_words = [w for w in expected.split() if len(w) > 3] matches = sum(1 for w in key_words if w in actual) if matches / len(key_words) >= 0.8: return 0.5
return 0.0Scorer with LLM grading
Section titled “Scorer with LLM grading”For open-ended questions, use an LLM to grade responses:
import boto3import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
GRADING_PROMPT = """You are grading an AI assistant's response to a support question.
Question: {question}Expected answer: {expected}Actual response: {actual}
Grade the response on a scale from 0 to 10:- 10: Perfect, complete, and accurate- 7-9: Mostly correct with minor gaps- 4-6: Partially correct but missing key information- 1-3: Mostly incorrect or misleading- 0: Completely wrong or harmful
Respond with only a JSON object: {{"score": <number>, "reason": "<brief explanation>"}}"""
def grade_with_llm(test_case, response) -> float: prompt = GRADING_PROMPT.format( question=test_case["input"], expected=test_case["expected"], actual=response["body"] )
result = bedrock.invoke_model( modelId="anthropic.claude-3-haiku-20240307-v1:0", body=json.dumps({ "anthropic_version": "bedrock-2023-05-31", "max_tokens": 256, "messages": [{"role": "user", "content": prompt}] }) )
output = json.loads(result["body"].read()) grade = json.loads(output["content"][0]["text"]) return grade["score"] / 10.0 # Normalize to 0.0–1.0Built-in scorers
Section titled “Built-in scorers”| Scorer name | Description |
|---|---|
exact_match | Response contains the exact expected string (case-insensitive) |
contains_answer | Response contains all key phrases from expected |
no_refusal | Response doesn’t contain refusal phrases (“I cannot”, “I’m unable to”) |
json_valid | Response is valid JSON |
no_hallucination | Response doesn’t contain phrases that contradict the expected answer |
tool_called | Agent called at least one tool (non-zero tool_calls) |
latency_sla | Response completed within a threshold (set threshold_ms in scorer config) |
Running evals
Section titled “Running evals”# Run the eval pack against your dev deploymentthinkwork eval run ./my-eval-pack -s dev
# Run only test cases with specific tagsthinkwork eval run ./my-eval-pack -s dev --tags billing,refund
# Run a single test casethinkwork eval run ./my-eval-pack -s dev --id tc-001
# Dry run: print what would be run without invoking the agentthinkwork eval run ./my-eval-pack -s dev --dry-runOutput during a run:
Running eval: support-bot-accuracy (32 test cases)Concurrency: 5
✓ tc-001 [1.23s] accuracy=1.0 format=1.0 ✓ tc-002 [0.98s] accuracy=1.0 format=1.0 ✗ tc-003 [2.14s] accuracy=0.0 format=1.0 ← FAIL ✓ tc-004 [1.05s] accuracy=0.5 format=1.0 ...
Results (32/32 complete): Pass rate: 87.5% (28/32) Mean accuracy: 0.84 Mean format: 0.97 Mean latency: 1,340ms (p50), 2,890ms (p95)
Results saved to: s3://dev-thinkwork-audit-logs/evals/support-bot-accuracy/2024-04-10T09-23-45Z/Viewing eval results
Section titled “Viewing eval results”Results are stored as JSON in S3 and surfaced in the admin app under Evals → Runs.
# Download resultsaws s3 cp s3://dev-thinkwork-audit-logs/evals/support-bot-accuracy/2024-04-10T09-23-45Z/results.json .
# Query results with jqcat results.json | jq '.cases[] | select(.scores.accuracy < 0.5) | {id, input, response: .response.body}'Results structure:
{ "evalPack": "support-bot-accuracy", "runId": "eval-run-abc123", "agentId": "agent-support", "startedAt": "2024-04-10T09:23:45Z", "completedAt": "2024-04-10T09:24:12Z", "summary": { "total": 32, "passed": 28, "failed": 4, "passRate": 0.875, "meanScores": { "accuracy": 0.84, "format": 0.97 } }, "cases": [ { "id": "tc-001", "input": "How do I reset my password?", "expected": "Go to the login page and click 'Forgot Password'", "response": { "body": "To reset your password, navigate to the login page and click the 'Forgot Password' link below the sign-in form.", "durationMs": 1230, "tokenCount": 142 }, "scores": { "accuracy": 1.0, "format": 1.0 }, "passed": true } ]}CI integration
Section titled “CI integration”Run evals in your deployment pipeline:
- name: Deploy to staging run: thinkwork deploy -s staging --auto-approve
- name: Run eval suite run: thinkwork eval run ./evals/support-bot-accuracy -s staging env: EVAL_FAIL_THRESHOLD: "0.80" # Fail CI if pass rate < 80%
- name: Promote to production (if evals pass) run: thinkwork deploy -s prod --auto-approve