On-call notes
When a compliance alarm fires, find it in the table below and follow the linked playbook. For background read overview. For operator-side procedures (export request flow, GOVERNANCE→COMPLIANCE flip, password rotation) read operator-runbook.
Alarm → playbook
Section titled “Alarm → playbook”| Alarm | Playbook |
|---|---|
thinkwork-{stage}-compliance-anchor-dlq-depth (or U8a/U8b equivalent) | Anchor DLQ depth non-zero |
thinkwork-{stage}-compliance-exports-dlq-depth | Exports DLQ depth non-zero |
thinkwork-{stage}-compliance-anchor-watchdog-heartbeat-missing | Anchor watchdog heartbeat missing |
thinkwork-{stage}-compliance-anchor-gap | Anchor gap too large |
| (CI gate, not CloudWatch) deploy fails on drift gate | Drift gate fails on deploy |
(No alarm; surface via DB monitoring) audit_outbox growing unbounded | audit_outbox runaway |
| (No alarm; surface via Strands runtime errors) audit emits silently failing | Strands runtime emit silent failure |
Anchor DLQ depth non-zero
Section titled “Anchor DLQ depth non-zero”Symptom: SQS queue thinkwork-{stage}-compliance-anchor-dlq has at least one message visible.
Likely cause: The anchor Lambda crashed on a scheduled invocation. EventBridge Scheduler routed the failure to the DLQ via the aws_lambda_function_event_invoke_config.compliance_anchor destination_config.on_failure block. MaximumRetryAttempts=0 on the function means a single failure goes straight to DLQ.
Resolution:
- Open CloudWatch logs for
thinkwork-{stage}-api-compliance-anchor. Find the most recent ERROR-level entries around the SQS message’sSentTimestamp. - Diagnose the failure class:
- Aurora connection drop / KMS rate limit / S3 throttle: transient. Next 15-minute cadence will succeed; the chain self-heals. Purge the DLQ message.
- Code regression (handler exception, missing env var): patch first. Redeploy. Then purge.
- WORM write rejected (Object Lock retention shortened, KMS key denied): check IAM and bucket policy. Anchor objects already written are immutable; the next cadence picks up where this one failed.
- The anchor Lambda is idempotent on cadence_id —
cadence_idis a deterministic hash of the chain-head fingerprint, so re-running for the same chain head produces the same anchor. Do not replay anchor DLQ messages by hand; the next scheduled tick will produce the right anchor at the right cadence boundary.
Anchor Lambda: packages/lambda/compliance-anchor.ts.
Exports DLQ depth non-zero
Section titled “Exports DLQ depth non-zero”Symptom: SQS queue thinkwork-{stage}-compliance-exports-dlq has at least one message visible.
Likely cause: The export runner Lambda crashed (handler exception, env var unset, IAM regression) on a real export request. The runner’s design path for business failures writes failed to compliance.export_jobs and returns SQS success — so DLQ messages are reserved for handler crashes.
Resolution:
-
Inspect a DLQ message body — should be
{"jobId": "<uuidv7>"}. -
Look up the row in
compliance.export_jobs:SELECT job_id, status, started_at, completed_at, job_errorFROM compliance.export_jobsWHERE job_id = '<uuidv7>'; -
Decision tree by current status:
-
runningfor >15 minutes: runner crashed mid-stream. Markfailed:UPDATE compliance.export_jobsSET status = 'failed',job_error = 'runner crashed; see DLQ',completed_at = now()WHERE job_id = '<uuidv7>'AND status = 'running';Purge the DLQ message. Operator submits a fresh export with the same filter.
-
failed: runner already wrote the failure before exiting. DLQ message is redundant. Purge. -
queued: the runner never claimed the job (CAS guard didn’t fire). The Lambda likely crashed at boot — env var missing, Aurora connection failed at module-load. Check CloudWatch logs for the function. After fixing, replay is safe because the CAS guard makes re-delivery a no-op when the job is no longer queued; but typically the underlying problem (env or IAM) needs a code/Terraform fix first.
-
-
Once the root cause is fixed and any stuck rows are reconciled, purge the DLQ.
Runner Lambda: packages/lambda/compliance-export-runner.ts. DLQ + alarm provisioned by terraform/modules/app/lambda-api/handlers.tf.
Anchor watchdog heartbeat missing
Section titled “Anchor watchdog heartbeat missing”Symptom: Alarm thinkwork-{stage}-compliance-anchor-watchdog-heartbeat-missing fires (ComplianceAnchorWatchdogHeartbeat metric missing for 2+ evaluation periods of 5 minutes each).
Likely cause (in priority order):
- Watchdog Lambda dead. Check CloudWatch logs for
thinkwork-{stage}-api-compliance-anchor-watchdog. Schedule should fire every 5 minutes via EventBridge. - CloudWatch metric publish failing. The watchdog uses
cloudwatch:PutMetricDatascoped to namespaceThinkwork/Compliance(per the IAM policy interraform/modules/data/compliance-audit-bucket/main.tf). A namespace condition mismatch silently drops metrics. - EventBridge Scheduler disabled or misconfigured. Confirm the schedule is
Enabledin the AWS console.
Resolution:
- Open CloudWatch logs for the watchdog. If logs show recent invocations + the metric publish call but the alarm still fires, suspect a metric-name typo or namespace mismatch.
- If logs show no recent invocations, suspect the schedule. Re-enable in console; if scheduled trigger is correct but Lambda doesn’t fire, check that the Lambda’s IAM trust policy still allows
events.amazonaws.comAssumeRole. - The watchdog itself has an explicit Deny on s3:DeleteObject, s3:PutObject, etc. against the anchor bucket (per U8b’s sibling-role design —
terraform/modules/data/compliance-audit-bucket/main.tf). A heartbeat failure does NOT mean the watchdog is mutating data; the worst case is silent observability gap.
Watchdog Lambda: packages/lambda/compliance-anchor-watchdog.ts.
Anchor gap too large
Section titled “Anchor gap too large”Symptom: Alarm fires when >30 minutes elapsed since the last successful anchor (the watchdog publishes ComplianceAnchorGap measuring now - last_anchored_recorded_at).
Likely cause:
- Anchor Lambda dead for ≥2 cadences. Check the anchor Lambda’s CloudWatch logs first.
- Cadence drift — Lambda is alive but invocations are running >15 min so the next one is delayed. Check duration metrics.
- Deploy window — if a
terraform-applywas running at the cadence boundary, the schedule may have been briefly disabled. False positive; will self-resolve at the next 15-min tick.
Resolution:
- Check anchor Lambda logs + duration metrics. If logs are empty for >30 min, suspect schedule / Lambda failure path; jump to Anchor DLQ depth non-zero.
- If logs show successful anchors but
last_anchored_recorded_atisn’t moving forward, the drainer is the bottleneck — checkcompliance.audit_outboxrow count (see audit_outbox runaway). The anchor only advances when the drainer publishes new chain heads. - Resolution is the same as anchor-DLQ recovery: fix root cause, let the next cadence self-heal. The anchor catches up automatically.
The 30-min threshold is the master plan’s documented detection target (“attacker disables job” risk row). Anchor cadence is rate(15min); two missed cadences crosses the threshold.
Drift gate fails on deploy
Section titled “Drift gate fails on deploy”Symptom: GitHub Actions deploy job fails on the drift-gate step with output like:
ERROR: hand-rolled migration 0074_compliance_event_hash_index.sql declares -- creates: compliance.idx_audit_events_event_hashbut the object is missing on dev.Likely cause: A PR added or modified a hand-rolled migration under packages/database-pg/drizzle/00NN_*.sql (one not registered in meta/_journal.json) but the operator did not apply it to dev before merging. The drift gate runs pnpm db:migrate-manual post-Terraform-apply; missing objects fail the deploy.
Resolution:
-
Confirm which migration:
pnpm db:migrate-manual --stage devoutputs the gap. -
Resolve dev
DATABASE_URL:Terminal window aws secretsmanager get-secret-value \--region us-east-1 \--secret-id thinkwork-dev-db-credentials \--query SecretString --output textConstruct the URL:
postgresql://<username>:<URL-encoded-password>@<host>:<port>/<dbname>?sslmode=require. The!character URL-encodes to%21if present in the password. -
Apply the migration:
psql "$DATABASE_URL" -f packages/database-pg/drizzle/00NN_<name>.sql. -
Re-run the failed GitHub Actions job. Drift gate passes.
The drift gate exists because hand-rolled migrations (those with explicit -- creates: markers) are deliberately outside Drizzle’s db:push scope — they encode constraints (CHECK, partial indexes, FK ordering) that Drizzle doesn’t represent cleanly. The gate ensures the production migration log stays consistent.
Reference: feedback_handrolled_migrations_apply_to_dev. Drift gate temporarily disabled in #905; re-enable when stable.
audit_outbox runaway
Section titled “audit_outbox runaway”Symptom: compliance.audit_outbox row count grows unbounded; queries against it slow down; chain advancement stalls (the anchor’s last_anchored_recorded_at doesn’t move).
Likely cause: The drainer Lambda is dead or stuck. Healthy drainer holds row count near zero (FOR UPDATE SKIP LOCKED + UPDATE drained_at runs every 5 seconds; reserved-concurrency=1 guarantees no parallel drainers).
Resolution:
-
Confirm with a row count:
SELECT count(*) FROM compliance.audit_outbox WHERE drained_at IS NULL;Healthy: less than 100. Concerning: more than 10,000. Critical: more than 100,000 (index degradation impacts emit-side performance).
-
Check drainer Lambda CloudWatch logs. Look for ERROR-level entries; these are usually:
- Aurora reader credential expired — rotate via
STAGE=<stage> bash scripts/bootstrap-compliance-roles.sh. The drainer reads ascompliance_drainer; secret name isthinkwork-{stage}-compliance-drainer-credentials. - Reserved-concurrency exhausted — another invocation is hung. Check the Lambda’s “Throttles” metric; if non-zero, force-stop hung executions via the AWS console.
- Hash-chain integrity failure — the drainer panics if
prev_hashof the next outbox row doesn’t match the current chain head inaudit_events. This indicates a manual mutation ofaudit_eventsor a race in concurrent drainers (which the reserved-concurrency=1 guarantee should prevent). Stop the drainer; investigate the chain manually.
- Aurora reader credential expired — rotate via
-
Once the drainer is healthy, it self-heals: the FOR UPDATE SKIP LOCKED loop drains the backlog at ~1000 rows / second.
Drainer Lambda: packages/lambda/compliance-outbox-drainer.ts. Drainer DLQ: thinkwork-{stage}-compliance-drainer-dlq.
Strands runtime emit silent failure
Section titled “Strands runtime emit silent failure”Symptom: Audit events from Strands runtime (folder-specialist operations, tool invocations from Python containers) stop appearing in audit_events while Yoga-emitted events continue normally. Strands logs show no obvious errors.
Likely cause: Env-var shadowing on the Strands container. The ComplianceClient reads THINKWORK_API_URL + API_AUTH_SECRET once at module load; if Lambda’s warm-container update path re-injects these env vars mid-process, a re-read mid-handler can pick up stale values that don’t match the live API endpoint.
This was observed on AgentCore deploys mid-2026 — see project_agentcore_deploy_race_env.
Resolution:
- Check
compliance-eventsREST handler logs for failed bearer validations. If you see auth failures from Strands, theAPI_AUTH_SECRETenv var on the Strands container is stale. - Force a warm pool flush on the Strands AgentCore Endpoint. The DEFAULT endpoint does not expose a flush API (project_agentcore_default_endpoint_no_flush) — the 15-minute reconciler is the only flush mechanism.
- Verify the Python
ComplianceClientsnapshots env at coroutine entry, not on each call. The fix landed in U6 but is fragile to refactor — if a contributor inadvertently moved the env read into a per-request function, audit emits silently break on the next AgentCore deploy.
Cross-runtime emit reference: developer-guide.md → Cross-runtime emit path.
Irreversibility warnings
Section titled “Irreversibility warnings”These operations are NOT autonomous on-call decisions. Page the operator-tier owner first:
- GOVERNANCE → COMPLIANCE Object Lock cutover on the anchor bucket. Once any object is written under COMPLIANCE retention, that object cannot be deleted or shortened by anyone (including AWS root) until retention expires (default 365 days). A wrong-stage flip is an unrecoverable disaster. See operator-runbook.md → GOVERNANCE → COMPLIANCE flip.
- TRUNCATE / DELETE on
compliance.audit_events. The schema’s immutability triggers (defined inpackages/database-pg/drizzle/0069_compliance_schema.sql) raise EXCEPTION on any DELETE / TRUNCATE attempt. If a privileged role bypasses the trigger, the chain is broken; auditors lose verification ability. - Lowering
var.retention_dayson the anchor bucket while in COMPLIANCE mode. Object Lock COMPLIANCE only honors retention extensions, not reductions. Lowering retention has no effect on existing objects; new objects get the new (lower) value, producing inconsistent retention across the chain. - Deleting a row from
compliance.export_jobs. Loses thedata.export_initiatedaudit trail’s downstream artifact. Failed exports stay in the table for traceability — never DELETE; markfailedand let the row persist.
Where to escalate
Section titled “Where to escalate”- Operator-tier action needed → operator-runbook.
- Code regression suspected → page the compliance-module owner; revert the suspected PR; redeploy.
- Auditor engagement is in progress and an alarm is firing → page the engagement lead; document the alarm + resolution in the engagement evidence ZIP.
- Master plan reference for any “is this expected behavior?” question →
docs/plans/2026-05-06-011-feat-compliance-audit-event-log-plan.md.