Code Sandbox

The Code Sandbox is the harness’s deterministic-execution surface for non-deterministic plans — the execute_code tool an agent uses to run Python against real data inside your AWS account. It’s peer-class with Managed Agents: ThinkWork handles provisioning, the per-turn lifecycle, and the audit trail; you assign it to a template and the agent picks it up on the next invocation. Two operating guarantees converge here: Security (per-tenant Bedrock AgentCore Code Interpreter instance — IAM isolation is structural, not application-enforced; the sandbox runs as a pure-compute primitive that never carries per-user OAuth credentials) and Reliability (every invocation is captured in sandbox_invocations with full stdout, exit status, and session id, so a failed run is debuggable without re-running it).

The sandbox runs on Bedrock AgentCore Code Interpreter, one instance per tenant. That per-tenant fanout is load-bearing: it’s what makes the IAM boundary between tenants a structural property of AWS itself, not a thing ThinkWork has to enforce in application code.

execute_code is a pure-compute primitive. It computes on data the agent already has or can reach via the session’s per-tenant IAM role — it does not carry per-user OAuth credentials. Agents that need OAuth-ed work (post to Slack, open a GitHub issue) call a composable-skill connector script instead.

This page is the conceptual deep dive. For the operator-side runbook — toggling policy, triaging failure modes, reading the residual-threat classes — see Sandbox environments (runbook).

When to reach for the sandbox

The sandbox is the right tool when:

The agent needs to compute, not just call. “Join these two result sets and summarise by region” is pandas + matplotlib, not a string-munging prompt.
The data lives in your AWS account. S3 buckets, Secrets Manager secrets, other tenant-scoped infrastructure. The sandbox runs inside your VPC’s IAM world, so it can reach them without an egress API.
You need the audit trail. Every invocation writes a sandbox_invocations row with tenant_id, agent_id, session_id, exit_status, byte counts, and a SHA-256 of the executed code.

The sandbox is not the right tool when:

You want to ship a general-purpose REPL to end users. It’s an agent-facing tool, not an IDE. There is no out-of-turn persistence.
You need OAuth-ed API work. Posting to Slack or opening a GitHub issue belongs in a composable-skill connector script that the agent calls as its own tool. Those scripts carry per-user credentials cleanly; the sandbox doesn’t.
You’re on a regulated compliance tier. The v1 substrate is explicitly not HIPAA-certified; regulated tenants have sandbox_enabled = false as the platform default. See the residual-threat list below.
You need long-running compute. Each turn creates a fresh session, runs one or a few execute_code calls, and stops the session. If you need minutes of compute, reach for Routines or a background job.

How a template opts in

Sandbox enrollment is a template-level field, not an agent-level flag. The whole template population opts in or out together; individual agents inherit the template’s choice.

sandbox:
  environment: default-public

environment — networking policy for the Code Interpreter instance. default-public means the sandbox has public egress (needed for fetching open data, calling no-auth APIs, reaching S3 endpoints). internal-only has no egress and is reserved for compute-only workloads where the session reads only pre-mounted data.

A template that declares sandbox but the tenant has sandbox_enabled = false fails closed at dispatch time — the tool never registers, and the agent gets a structured SandboxProvisioning error it can explain to the user.

For the admin surface that toggles per-tenant policy, see the operator runbook’s Toggling tenant policy section.

What the agent sees

At invocation time, when a template declaring sandbox runs for a tenant with sandbox_enabled = true, the dispatcher registers a single Strands tool:

execute_code(code: str) -> dict

The agent passes a block of Python; the tool returns a structured result:

{
  "ok": true,
  "stdout": "...",
  "stderr": "...",
  "exit_status": "ok",
  "duration_ms": 1842
}

On error paths — cap breach, provisioning failure, timeout — ok is false and error carries a named class the agent can react to:

`error` value	Meaning
`SandboxProvisioning`	Tenant has the policy on but interpreter IDs aren’t populated yet. Transient during cold deploys.
`SandboxCapExceeded`	Circuit breaker fired. `error_message` carries `dimension` (`tenant_daily` / `agent_hourly`) and `resets_at`.

Agents are expected to recover gracefully — the cap-breach shape is “I’ve hit the daily cap, please try again at 00:00 UTC.”

The first thing any sandbox session does is a one-line readiness check: executeCode call #1 imports sitecustomize and confirms the stdio redactor wrapped sys.stdout. If the check fails, the session aborts before user code runs on an unmitigated image. User code runs as executeCode call #2+.

Per-turn lifecycle

One line per step, in order, for one turn that calls execute_code once:

Dispatch — user message lands, agent + template resolved.
Pre-flight — policy check + interpreter-ready check. Dispatcher threads sandbox_interpreter_id + sandbox_environment onto the invocation payload.
Tool register — execute_code appears in the Strands tool surface for this turn only.
Agent calls execute_code(code) — the substrate call begins.
Quota check — atomic WHERE count < cap increment against sandbox_tenant_daily_counters + sandbox_agent_hourly_counters. Breach ⇒ SandboxCapExceeded, no session created.
Session start — raw boto3 bedrock-agentcore.StartCodeInterpreterSession against the tenant’s interpreter, followed by the readiness preamble (sitecustomize.installed() check) as InvokeCodeInterpreter call #1.
User code — InvokeCodeInterpreter with name="executeCode" runs the agent’s Python. Response is an event stream of MCP tool-result envelopes (result.content[] for streaming text + result.structuredContent for {stdout, stderr, exitCode}).
Audit row — one sandbox_invocations row written with exit_status, byte counts, executed_code_hash, session_id.
Session stop — StopCodeInterpreterSession.

The turn continues — the agent sees the tool result and decides whether to answer, call another tool, or call execute_code again. Every subsequent execute_code call on the same turn reuses the same session but re-runs the quota check and writes a fresh audit row.

Circuit breakers

Two caps, both per-tenant, both enforced by the sandbox-quota-check Lambda:

Dimension	Default cap	Stored in
`tenant_daily`	1000 calls / UTC day	`sandbox_tenant_daily_counters`
`agent_hourly`	100 calls / UTC hour	`sandbox_agent_hourly_counters`

Both are raised via SSM (/thinkwork/{stage}/sandbox/caps/*), not by redeploy. Setting either to 0 is a legitimate kill-switch — the tool registers but every call rejects, which is the behaviour you want if an incident demands killing sandbox traffic without redeploying.

Breaches surface to the agent as SandboxCapExceeded with resets_at; the agent is expected to acknowledge in chat rather than retry.

The audit row

Every call writes exactly one sandbox_invocations row. The columns worth knowing:

session_id — join key to /aws/bedrock-agentcore/runtimes/* log streams.
executed_code_hash — SHA-256 of the user code. Stable across tenants, so repeat invocations of the same code correlate cleanly.
stdout_bytes / stderr_bytes — raw pre-truncation sizes. Stdout is truncated at 256 KB in the agent-visible result; stderr at 32 KB. The full content lives in CloudWatch.
stdout_truncated / stderr_truncated — true when the limits fired.
exit_status — ok | error | timeout | oom | cap_exceeded | provisioning.
failure_reason — populated when ok = false; carries the tool-level error message.

Retention defaults to 30 days with a 180-day ceiling enforced by a DB CHECK constraint. The table is append-only; there’s no update path.

Residual threats — named up front

The sandbox substrate ships with a short list of threat classes explicitly called out. They are not bugs; they are hardening tracks on the v2 roadmap. Surfacing them here up front is the model peer harnesses (Anthropic Managed Agents, LangChain Deep Agents) have converged on.

Track	Class	v2 fix
T2	Malicious `pip install` — runtime `pip install` has no allowlist; a typo-squatted or compromised package executes at import time with access to whatever data the session reads	Private PyPI mirror + install allowlist
T3	PHI/PII handling — the sandbox isn’t HIPAA-certified; regulated-tenant platform default is `sandbox_enabled = false`	Regulated-tenant-specific environment with per-log-group encryption and shorter retention

There’s also a stdout-bypass class — os.write(fd, ...), C-extension direct writes, multiprocessing workers in fresh processes, adversarial split-writes across the redactor’s rolling-buffer window. The CloudWatch subscription-filter backstop covers the subset whose values match known OAuth prefixes (in case an agent prints a token-shaped string it fetched from an API response); the primary stdio redactor covers everything flowing through Python’s normal print path.

Operators triaging a “did we leak a token?” incident should check the residual-threat list first. If the leak matches a named class, the incident is expected and will land under v2. If it matches no class, that’s a real regression in the stdio redactor and platform security should be paged — see the runbook’s When to call platform security section.

How it sits next to the rest of the stack

Skills that need execute_code declare it via the template’s sandbox block. They don’t register the tool themselves. See Skills.
Templates control which tenant populations get the sandbox. Flipping a template’s sandbox block on rolls out to every agent instanced from it. See Templates.
Connectors supply OAuth tokens for other agent-facing tools (typed skills, MCP bridges, composable-skill connector scripts). They do not flow into the sandbox — execute_code stays a pure-compute primitive. See Connectors.
Guardrails evaluate the agent’s responses, not the sandbox’s Python. If you need to gate what code the agent writes, that’s a system-prompt and template-level decision, not a sandbox feature.
Budgets include sandbox calls — the cost-per-call is a configurable line item in the tenant’s usage accounting. See Budgets, Usage, and Audit.