Code Sandbox
The Code Sandbox is the harness’s deterministic-execution surface for non-deterministic plans — the execute_code tool an agent uses to run Python against real data inside your AWS account. It’s peer-class with Managed Agents: ThinkWork handles provisioning, the per-turn lifecycle, and the audit trail; you assign it to a template and the agent picks it up on the next invocation. Two operating guarantees converge here: Security (per-tenant Bedrock AgentCore Code Interpreter instance — IAM isolation is structural, not application-enforced; the sandbox runs as a pure-compute primitive that never carries per-user OAuth credentials) and Reliability (every invocation is captured in sandbox_invocations with full stdout, exit status, and session id, so a failed run is debuggable without re-running it).
The sandbox runs on Bedrock AgentCore Code Interpreter, one instance per tenant. That per-tenant fanout is load-bearing: it’s what makes the IAM boundary between tenants a structural property of AWS itself, not a thing ThinkWork has to enforce in application code.
execute_code is a pure-compute primitive. It computes on data the agent already has or can reach via the session’s per-tenant IAM role — it does not carry per-user OAuth credentials. Agents that need OAuth-ed work (post to Slack, open a GitHub issue) call a composable-skill connector script instead.
This page is the conceptual deep dive. For the operator-side runbook — toggling policy, triaging failure modes, reading the residual-threat classes — see Sandbox environments (runbook).
When to reach for the sandbox
Section titled “When to reach for the sandbox”The sandbox is the right tool when:
- The agent needs to compute, not just call. “Join these two result sets and summarise by region” is pandas + matplotlib, not a string-munging prompt.
- The data lives in your AWS account. S3 buckets, Secrets Manager secrets, other tenant-scoped infrastructure. The sandbox runs inside your VPC’s IAM world, so it can reach them without an egress API.
- You need the audit trail. Every invocation writes a
sandbox_invocationsrow with tenant_id, agent_id, session_id, exit_status, byte counts, and a SHA-256 of the executed code.
The sandbox is not the right tool when:
- You want to ship a general-purpose REPL to end users. It’s an agent-facing tool, not an IDE. There is no out-of-turn persistence.
- You need OAuth-ed API work. Posting to Slack or opening a GitHub issue belongs in a composable-skill connector script that the agent calls as its own tool. Those scripts carry per-user credentials cleanly; the sandbox doesn’t.
- You’re on a regulated compliance tier. The v1 substrate is explicitly not HIPAA-certified; regulated tenants have
sandbox_enabled = falseas the platform default. See the residual-threat list below. - You need long-running compute. Each turn creates a fresh session, runs one or a few
execute_codecalls, and stops the session. If you need minutes of compute, reach for Routines or a background job.
How a template opts in
Section titled “How a template opts in”Sandbox enrollment is a template-level field, not an agent-level flag. The whole template population opts in or out together; individual agents inherit the template’s choice.
sandbox: environment: default-publicenvironment — networking policy for the Code Interpreter instance. default-public means the sandbox has public egress (needed for fetching open data, calling no-auth APIs, reaching S3 endpoints). internal-only has no egress and is reserved for compute-only workloads where the session reads only pre-mounted data.
A template that declares sandbox but the tenant has sandbox_enabled = false fails closed at dispatch time — the tool never registers, and the agent gets a structured SandboxProvisioning error it can explain to the user.
For the admin surface that toggles per-tenant policy, see the operator runbook’s Toggling tenant policy section.
What the agent sees
Section titled “What the agent sees”At invocation time, when a template declaring sandbox runs for a tenant with sandbox_enabled = true, the dispatcher registers a single Strands tool:
execute_code(code: str) -> dictThe agent passes a block of Python; the tool returns a structured result:
{ "ok": true, "stdout": "...", "stderr": "...", "exit_status": "ok", "duration_ms": 1842}On error paths — cap breach, provisioning failure, timeout — ok is false and error carries a named class the agent can react to:
error value | Meaning |
|---|---|
SandboxProvisioning | Tenant has the policy on but interpreter IDs aren’t populated yet. Transient during cold deploys. |
SandboxCapExceeded | Circuit breaker fired. error_message carries dimension (tenant_daily / agent_hourly) and resets_at. |
Agents are expected to recover gracefully — the cap-breach shape is “I’ve hit the daily cap, please try again at 00:00 UTC.”
The first thing any sandbox session does is a one-line readiness check: executeCode call #1 imports sitecustomize and confirms the stdio redactor wrapped sys.stdout. If the check fails, the session aborts before user code runs on an unmitigated image. User code runs as executeCode call #2+.
Per-turn lifecycle
Section titled “Per-turn lifecycle”One line per step, in order, for one turn that calls execute_code once:
- Dispatch — user message lands, agent + template resolved.
- Pre-flight — policy check + interpreter-ready check. Dispatcher threads
sandbox_interpreter_id+sandbox_environmentonto the invocation payload. - Tool register —
execute_codeappears in the Strands tool surface for this turn only. - Agent calls
execute_code(code)— the substrate call begins. - Quota check — atomic
WHERE count < capincrement againstsandbox_tenant_daily_counters+sandbox_agent_hourly_counters. Breach ⇒SandboxCapExceeded, no session created. - Session start — raw boto3
bedrock-agentcore.StartCodeInterpreterSessionagainst the tenant’s interpreter, followed by the readiness preamble (sitecustomize.installed()check) asInvokeCodeInterpretercall #1. - User code —
InvokeCodeInterpreterwithname="executeCode"runs the agent’s Python. Response is an event stream of MCP tool-result envelopes (result.content[]for streaming text +result.structuredContentfor{stdout, stderr, exitCode}). - Audit row — one
sandbox_invocationsrow written with exit_status, byte counts, executed_code_hash, session_id. - Session stop —
StopCodeInterpreterSession.
The turn continues — the agent sees the tool result and decides whether to answer, call another tool, or call execute_code again. Every subsequent execute_code call on the same turn reuses the same session but re-runs the quota check and writes a fresh audit row.
Circuit breakers
Section titled “Circuit breakers”Two caps, both per-tenant, both enforced by the sandbox-quota-check Lambda:
| Dimension | Default cap | Stored in |
|---|---|---|
tenant_daily | 1000 calls / UTC day | sandbox_tenant_daily_counters |
agent_hourly | 100 calls / UTC hour | sandbox_agent_hourly_counters |
Both are raised via SSM (/thinkwork/{stage}/sandbox/caps/*), not by redeploy. Setting either to 0 is a legitimate kill-switch — the tool registers but every call rejects, which is the behaviour you want if an incident demands killing sandbox traffic without redeploying.
Breaches surface to the agent as SandboxCapExceeded with resets_at; the agent is expected to acknowledge in chat rather than retry.
The audit row
Section titled “The audit row”Every call writes exactly one sandbox_invocations row. The columns worth knowing:
session_id— join key to/aws/bedrock-agentcore/runtimes/*log streams.executed_code_hash— SHA-256 of the user code. Stable across tenants, so repeat invocations of the same code correlate cleanly.stdout_bytes/stderr_bytes— raw pre-truncation sizes. Stdout is truncated at 256 KB in the agent-visible result; stderr at 32 KB. The full content lives in CloudWatch.stdout_truncated/stderr_truncated—truewhen the limits fired.exit_status—ok | error | timeout | oom | cap_exceeded | provisioning.failure_reason— populated whenok = false; carries the tool-level error message.
Retention defaults to 30 days with a 180-day ceiling enforced by a DB CHECK constraint. The table is append-only; there’s no update path.
Residual threats — named up front
Section titled “Residual threats — named up front”The sandbox substrate ships with a short list of threat classes explicitly called out. They are not bugs; they are hardening tracks on the v2 roadmap. Surfacing them here up front is the model peer harnesses (Anthropic Managed Agents, LangChain Deep Agents) have converged on.
| Track | Class | v2 fix |
|---|---|---|
| T2 | Malicious pip install — runtime pip install has no allowlist; a typo-squatted or compromised package executes at import time with access to whatever data the session reads | Private PyPI mirror + install allowlist |
| T3 | PHI/PII handling — the sandbox isn’t HIPAA-certified; regulated-tenant platform default is sandbox_enabled = false | Regulated-tenant-specific environment with per-log-group encryption and shorter retention |
There’s also a stdout-bypass class — os.write(fd, ...), C-extension direct writes, multiprocessing workers in fresh processes, adversarial split-writes across the redactor’s rolling-buffer window. The CloudWatch subscription-filter backstop covers the subset whose values match known OAuth prefixes (in case an agent prints a token-shaped string it fetched from an API response); the primary stdio redactor covers everything flowing through Python’s normal print path.
Operators triaging a “did we leak a token?” incident should check the residual-threat list first. If the leak matches a named class, the incident is expected and will land under v2. If it matches no class, that’s a real regression in the stdio redactor and platform security should be paged — see the runbook’s When to call platform security section.
How it sits next to the rest of the stack
Section titled “How it sits next to the rest of the stack”- Skills that need
execute_codedeclare it via the template’ssandboxblock. They don’t register the tool themselves. See Skills. - Templates control which tenant populations get the sandbox. Flipping a template’s
sandboxblock on rolls out to every agent instanced from it. See Templates. - Connectors supply OAuth tokens for other agent-facing tools (typed skills, MCP bridges, composable-skill connector scripts). They do not flow into the sandbox —
execute_codestays a pure-compute primitive. See Connectors. - Guardrails evaluate the agent’s responses, not the sandbox’s Python. If you need to gate what code the agent writes, that’s a system-prompt and template-level decision, not a sandbox feature.
- Budgets include sandbox calls — the cost-per-call is a configurable line item in the tenant’s usage accounting. See Budgets, Usage, and Audit.
See also
Section titled “See also”- Sandbox environments — operator runbook — toggling policy, triaging failure modes, what to monitor in CloudWatch, when to call platform security.
- AgentCore Code Sandbox plan — the 13-unit implementation plan with per-unit test scenarios.
- Sandbox E2E harness — live-infra test suite with the required env vars + run commands.
sandbox-pilotreference skill — a skill pack exercising the full sandbox path end-to-end, used by operators to validate a fresh stage.