Skip to content

Source Routing

A ThinkWork turn is not just “user message goes to the model.” Before the model sees anything, Memory assembles context from eligible sources, budgets them against the model’s context window, and decides what to include vs. truncate. This source routing is the Perception phase of PPAF: the moment the harness decides what the model gets to see.

Any given turn can draw from:

  • Thread history. The recent conversation in this thread — prior user messages, agent responses, tool calls, and their outputs. This is the short-term context.
  • Knowledge bases. Chunks retrieved from assigned Bedrock Knowledge Bases, based on semantic similarity to the user’s message.
  • Memory. Facts and preferences recalled from the memory adapter (Hindsight or AgentCore Memory). These are carry-forward learnings, not chat history.
  • Compiled pages. Relevant entity, topic, and decision pages produced by the Memory compile pipeline.
  • Workspace files. Files in the relevant agent, template, or tenant-default workspace.
  • Approved MCP tools. Search-safe external tools that an admin has made eligible.

Not every source feeds every turn. A focused chatbot may use only thread history + document knowledge. A long-running research agent adds memory. An agent deeply embedded in a user’s workspace uses all four.

Context assembly happens inside the AgentCore runtime before any Bedrock call. The shape:

User message arrives in thread
Load agent + template (which sources are enabled?)
Query the enabled sources in parallel:
├─ thread.getRecentTurns(threadId, limit)
├─ knowledge.retrieve(query=message, kbIds, topK)
├─ memory.recall(query=message, ownerId, k)
└─ wiki.recall(query=message, ownerId)
Budget + merge: trim each source until the combined token
count fits the model's context window with headroom for
the system prompt, tool definitions, and response.
Assembled context → Bedrock converse call

Source queries run in parallel. Whichever is slowest sets the turn’s baseline latency — usually Bedrock KB retrieval or memory recall. Compiled page lookup is fast (a single structured query against wiki_pages + wiki_page_sections).

The model’s context window is a hard limit. Every token spent on retrieved context is a token unavailable for the response. The harness budgets aggressively:

  • System prompt — template base + agent-specific prompt. Typically 500–2000 tokens.
  • Tool definitions — registered tools from skill packs, integrations, MCP. Typically 200–1500 tokens depending on agent surface.
  • Response headroommaxTokens from the template, reserved so the model has room to generate.
  • Retrieved context — what’s left after the above. The four sources compete for this pool.

When the retrieval layer produces more content than the budget allows, it truncates by priority:

  1. Most recent thread turns are kept in full.
  2. Top-K retrieved document chunks are kept in full until the budget runs out.
  3. Recalled memories and wiki pages get summarized (first 2–3 lines of each) before being included.
  4. Older thread history is summarized or dropped.

This priority is tunable through the runtime configuration surfaces that feed the current turn. In Space-aware operation, tune retrieval behavior for the Space that needs it: a Space that should lean heavily on retrieved documents uses a higher KB topK, while a Space that should lean on memory uses a higher memory recall count where exposed.

It’s tempting to collapse everything into one big “just feed the model context” blob. Three reasons not to:

  • Separate cost accounting. Tokens from retrieved docs vs. recalled memories vs. thread history show up separately in audit records. An operator debugging “why did this turn cost $2” can see whether the document KB was bloated or whether the thread history was too long.
  • Separate failure modes. If the KB sync is broken today, document retrieval fails silently — and the rest of the turn still works. A single blob would fail the whole turn.
  • Separate tuning knobs. You change the topK for KB retrieval without touching memory recall. Without the separation, tuning is all-or-nothing.

A distinction that bites people:

  • Short-term context = thread history. It’s the canonical record of what happened in this conversation. The model reads it verbatim (within budget).
  • Long-term memory = facts recalled from prior work. The model reads these as selective summaries, not verbatim chat logs.

Long-term memory should never masquerade as the thread record. If an operator asks “what did the agent say in thread X,” the answer is in the thread’s turn rows, not in the memory adapter. Memory is derived context; the thread is canonical record.

  • No runtime introspection of what was assembled. The admin thread detail shows the final context shape for a turn (which KB returned which chunks, which memories were recalled), but modifying the assembly mid-turn isn’t possible. Tune at the template level and re-run.
  • Budgeting is heuristic. The harness approximates token counts; the model’s tokenizer is authoritative. A turn that appears to fit the budget can still truncate if the estimate is off by a few percent. Keep headroom.
  • Compiled page recall is optional per-invocation. If the compile job is behind, compiled page lookup returns older content. Operators can force a compile if this matters.

Context assembly happens inside the AgentCore runtime (packages/agentcore-strands/agent-container/) before any Bedrock call. The four sources are queried in parallel, token-counted, and merged into the final input. Budget knobs (retrieval top_k, memory recall count, max thread-history window) live on the agent template and can be tuned without a redeploy.

Each source emits OpenTelemetry spans recording latency, result count, and token contribution — surfaced in the turn trace view. See Admin: Threads for the turn-trace operator view.