Operator runbook
How to do things to the running compliance module. For background on why the module exists, read overview. For the architectural shape, read architecture. When an alarm fires, jump to oncall.
Procedures
Section titled “Procedures”- Inspect compliance events in admin
- Request a compliance export
- Apply a hand-rolled compliance migration to dev before merging
- Bootstrap or rotate Aurora compliance role passwords
- Flip S3 Object Lock GOVERNANCE → COMPLIANCE for an audit engagement
- Drain the compliance-anchor DLQ
- Drain the compliance-exports DLQ
- Re-run a failed export
Inspect compliance events in admin
Section titled “Inspect compliance events in admin”When to use this: You need to see what audit events the platform recorded for a tenant or time window — for an internal review or to prepare for an auditor walkthrough.
- Sign in to admin as an operator (your email must be in
THINKWORK_PLATFORM_OPERATOR_EMAILSon the graphql-http Lambda). - Click Compliance in the sidebar (between Settings and Billing).
- The list page defaults to the last 7 days. Use the filter bar to narrow by
event_type,actor_type, or a customsince/untilwindow. - Click any row to open the event detail page. The detail page shows:
- Event metadata (event_id, occurred_at, recorded_at, actor, source).
- Chain position panel: the event’s own hash + previous-event hash (clickable to walk backward).
- Anchor status: ANCHORED with
cadence_idonce the next 15-minute cadence completes; PENDING in the meantime. - Payload (redacted per the per-event-type allow-list at write time; see developer-guide).
- To browse cross-tenant (operators only): toggle Cross-tenant view in the filter bar. Off by default; the toggle adds the tenant Combobox to the filter row.
The /compliance read API source: packages/api/src/graphql/resolvers/compliance/query.ts.
Request a compliance export
Section titled “Request a compliance export”When to use this: An auditor asks for a CSV/NDJSON of audit events matching a filter.
- From the Compliance events list, set the filter to the slice the auditor wants (date range + event types).
- Click Export this view in the page header. The Exports dialog opens pre-filled with the current filter; the URL carries
?from=current-filter. - Choose CSV (auditor-friendly default) or JSON (NDJSON wire format).
- Click Queue export. The job appears in the table at
/compliance/exports/with statusQueued. - The page polls every 3 seconds while any job is
QueuedorRunning. Status transitions:Queued → Running → Complete(orFailed). - When status flips to
Complete, click Download in the Action column. The file downloads directly from S3 via a 15-minute presigned URL. - If the URL expires before download (the
presigned_url_expires_atcolumn is pastnow()), the action cell shows “Download link expired — re-export.” Submit a fresh export with the same filter.
Hard caps:
- Max filter window: 90 days (
until - since). Wider rejects withFILTER_RANGE_TOO_WIDE. - Max filter byte size: 4 KB serialized. Larger rejects with
FILTER_TOO_LARGE. - Rate limit: 10 exports per hour per operator email. 11th rejects with
RATE_LIMIT_EXCEEDED.
The export emits a data.export_initiated audit event with the filter as payload — exporting is itself audited.
Resolver source: packages/api/src/graphql/resolvers/compliance/exports.ts.
Apply a hand-rolled compliance migration to dev before merging
Section titled “Apply a hand-rolled compliance migration to dev before merging”When to use this: Your PR adds a hand-rolled SQL file under packages/database-pg/drizzle/00NN_*.sql that is not in meta/_journal.json. The post-deploy drift gate checks that every -- creates: marker in such files resolves on the target DB. Skipping this step fails the deploy.
- Read your migration carefully. It should declare
-- creates: schema.objectmarkers in the header for every object it provisions. - Resolve the dev
DATABASE_URL:aws secretsmanager get-secret-value --region us-east-1 --secret-id thinkwork-dev-db-credentials --query SecretString --output text(or use the bootstrap helper if rotating roles). - Apply:
psql "$DATABASE_URL" -f packages/database-pg/drizzle/00NN_<your-migration>.sql. - Verify with
pnpm db:migrate-manual— should report APPLIED for the new objects. - Open the PR. The
terraform-applyjob’s drift-gate step will pass.
If you forget this step, the PR’s deploy job fails on the drift gate with a list of missing objects. The fix is to apply the migration to dev and re-run the failed job — no code change needed.
Reference: feedback_handrolled_migrations_apply_to_dev. Recent compliance migrations applied this way: 0069, 0070, 0073, 0074.
Bootstrap or rotate Aurora compliance role passwords
Section titled “Bootstrap or rotate Aurora compliance role passwords”When to use this: First-time provisioning of a stage’s compliance roles, or rotating any of the three role passwords.
-
The bootstrap helper wraps the SQL apply with Secrets Manager population:
Terminal window STAGE=dev bash scripts/bootstrap-compliance-roles.sh -
The script generates fresh passwords for
compliance_writer,compliance_drainer,compliance_reader, runs0070_compliance_aurora_roles.sqlagainst the stage’s DB, and writes each password to its corresponding Secrets Manager secret (thinkwork-{stage}-compliance-{role}-credentials). -
Re-running the script rotates passwords idempotently — the SQL
DO $$ ... ALTER ROLE ... PASSWORD %Lblock updates without breaking existing connections (next connect picks up the new password). -
After rotation, the graphql-http Lambda + the anchor + drainer Lambdas all pick up the new passwords on cold start. There is no warm-flush API; if you must force a warm pool to refresh immediately, redeploy the affected functions.
Direct apply (advanced — must supply all three passwords explicitly):
psql "$DATABASE_URL" \ -v writer_pass="$COMPLIANCE_WRITER_PASS" \ -v drainer_pass="$COMPLIANCE_DRAINER_PASS" \ -v reader_pass="$COMPLIANCE_READER_PASS" \ -f packages/database-pg/drizzle/0070_compliance_aurora_roles.sqlMigration: packages/database-pg/drizzle/0070_compliance_aurora_roles.sql. Helper: scripts/bootstrap-compliance-roles.sh.
Flip S3 Object Lock GOVERNANCE → COMPLIANCE for an audit engagement
Section titled “Flip S3 Object Lock GOVERNANCE → COMPLIANCE for an audit engagement”When to use this: A production audit engagement is starting. The anchor bucket needs to be in COMPLIANCE mode to satisfy the auditor’s “can anyone bypass this?” question.
DANGER — IRREVERSIBLE. Once any object is written under COMPLIANCE mode with retention X, that object cannot be deleted or shortened by anyone (including AWS root) until X expires. Default retention is 365 days. Verify the stage twice before applying.
- Confirm the stage is
prod(or a non-prod stage you have explicitly approved for COMPLIANCE — see step 4). - Edit the stage’s tfvars:
compliance_anchor_object_lock_mode = "COMPLIANCE". - Run
terraform planfrom the composite root. The plan should show only a change onaws_s3_bucket_object_lock_configuration.anchor(modeGOVERNANCE→COMPLIANCE). If the plan shows other resource changes, stop — something else is dirty. - Non-prod safeguard: the module’s
lifecycle.preconditionblocks COMPLIANCE on non-prod stages by default. To intentionally enable on a non-prod stage, also setallow_compliance_in_non_prod = truein tfvars. Master plan U7/U8b documented this guard explicitly. - Apply with manual confirmation. Do not pipe
yes |or use--auto-approve. - After apply, verify in the S3 console that the anchor bucket’s Object Lock retention mode reads COMPLIANCE.
The Terraform module README has the full playbook:
terraform/modules/data/compliance-audit-bucket/README.md.
Drain the compliance-anchor DLQ
Section titled “Drain the compliance-anchor DLQ”When to use this: CloudWatch alarm thinkwork-{stage}-compliance-anchor-dlq (or its U8a/U8b equivalent) fires. The anchor Lambda crashed on at least one invocation; messages are sitting in the DLQ.
- Open the SQS console; locate
thinkwork-{stage}-compliance-anchor-dlq. - Use Send and receive messages → Poll for messages. Inspect each message body. Anchor invocations carry no payload (scheduler-triggered) so the body is the raw EventBridge event.
- Cross-reference with CloudWatch logs for
thinkwork-{stage}-api-compliance-anchoraround the message’sSentTimestamp. The error in the log is the actionable signal. - Decision tree:
- Transient infra error (Aurora connection drop, S3 throttle, KMS rate limit): the next scheduled cadence will produce a fresh anchor; the DLQ message can be purged. The chain catches up automatically.
- Code regression (handler panic, unhandled exception): patch first, deploy, then purge.
- Configuration error (env var unset, IAM permission gap): fix terraform first, redeploy, then purge.
- To purge a single message after handling: select it in the console and click Delete. To purge the entire queue (only when you have confirmed every message is recoverable): Purge queue.
Do not “replay” anchor DLQ messages — the schedule is rate-based, so replaying produces a duplicate cadence at an unintended time. Let the next 15-minute tick re-anchor.
Anchor Lambda: packages/lambda/compliance-anchor.ts.
Drain the compliance-exports DLQ
Section titled “Drain the compliance-exports DLQ”When to use this: CloudWatch alarm thinkwork-{stage}-compliance-exports-dlq-depth fires.
- Open the SQS console; locate
thinkwork-{stage}-compliance-exports-dlq. - Each message body is
{"jobId": "<uuidv7>"}. Look up the job incompliance.export_jobsto see what filter the operator requested. - Check the job’s status:
- If
runningfor >15 minutes: the runner crashed. Markfailedwithjob_error = 'runner crashed; see DLQ', then purge the message. Operator submits a fresh export. - If
failed: the runner already wrote the failure to the DB before exiting; the DLQ message is redundant. Purge. - If
queued: the runner never claimed the job (CAS guard didn’t fire). Investigate the Lambda — likely env-var regression. Purge after fix.
- If
- Replaying the message is safe because the runner’s CAS guard (
UPDATE … WHERE status='queued') makes re-delivery a no-op when the job is no longer queued. But replay is rarely useful — the underlying problem (handler crash) needs a code fix first.
Runner: packages/lambda/compliance-export-runner.ts.
Re-run a failed export
Section titled “Re-run a failed export”When to use this: An operator reports a Failed export and wants the same slice again.
- Open
/compliance/exports/. The failed job’s row shows the truncatedjob_errorin the Status column (hover for the full message via thetitleattribute). - From the events list page (
/compliance), reconstruct the original filter (the failed job’sfilterJSON in the table tells you what to set). - Click Export this view with the reconstructed filter; submit. A fresh job queues; the failed row stays for traceability.
There is no “clone failed job” mutation at v1. Reconstructing the filter from the row keeps the audit trail clean (each data.export_initiated event is its own provable request).
Where to escalate
Section titled “Where to escalate”- An alarm fired and the playbook didn’t fix it → oncall.
- A migration failed mid-apply or the drift gate is stuck → oncall.
- An operator email needs adding to the allowlist → set
THINKWORK_PLATFORM_OPERATOR_EMAILSon the graphql-http Lambda’s environment via Terraform; redeploy graphql-http. - An auditor needs the verifier CLI run against an attested slice →
packages/audit-verifier/README.md.