Agentic Architecture & Orchestration
27% of examThe heaviest domain. Focuses on how Claude-powered agents are designed to operate autonomously: the core agentic loop, multi-agent coordination patterns, context passing, workflow enforcement, and session management.
The Agentic Loop
An agentic loop is the mechanism that lets Claude autonomously execute multi-step tasks by iteratively calling tools and observing results. The loop continues until Claude decides it has finished.
stop_reasonstop_reason === "tool_use": execute requested tools, append results to history → go to step 1stop_reason === "end_turn": present final response to user → loop ends# Minimal agentic loop (Python-style pseudocode)
messages = [{"role": "user", "content": user_input}]
while True:
response = client.messages.create(model=model, tools=tools, messages=messages)
if response.stop_reason == "end_turn":
# Done — return the assistant's text
return response.content[0].text
if response.stop_reason == "tool_use":
# Execute tools and append results
messages.append({"role": "assistant", "content": response.content})
tool_results = []
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "user", "content": tool_results})
- Parsing text for completion signals — checking if the assistant said "Done!" or "Task complete" is unreliable. Claude may say these mid-task.
- Fixed iteration caps as primary stopping mechanism — a hard limit of "max 10 iterations" is a safety guard, not the primary termination signal.
stop_reasonis. - Checking assistant text content to decide whether to continue — always use
stop_reason.
The loop continues when stop_reason === "tool_use" and terminates when stop_reason === "end_turn". These are the only two stop reasons you need to handle in basic agentic loops.
Hub-and-Spoke Multi-Agent Architecture
In a multi-agent system, a coordinator agent manages all communication between specialist subagents. All inter-agent communication routes through the coordinator — subagents never communicate directly with each other.
Subagents do NOT automatically inherit the coordinator's conversation history or memory. Every piece of context the subagent needs must be explicitly included in the Task prompt. This is one of the most tested concepts.
Coordinator Responsibilities
- Task decomposition — breaking the goal into subtasks based on query complexity
- Delegation — deciding which subagent handles which subtask
- Result aggregation — combining subagent outputs into a coherent final result
- Error handling & recovery — deciding what to do when subagents fail
If the coordinator decomposes a broad topic too narrowly, the entire system produces incomplete coverage — even if every subagent executes perfectly. Example: decomposing "AI in creative industries" into only visual arts subtasks omits music, film, writing. The subagents can't cover what they were never assigned.
Iterative Refinement Loop
After synthesis, the coordinator can evaluate output quality, identify gaps, re-delegate targeted queries to search/analysis agents, and re-invoke synthesis. This iterative refinement loop is the production-grade pattern for research systems.
Spawning Subagents with the Task Tool
The Task tool is the mechanism for spawning subagents in the Claude Agent SDK. The coordinator's allowedTools list must include "Task" for it to spawn subagents — without this, subagent invocation fails.
Parallel vs Sequential Subagent Execution
✓ Parallel (Correct)
Emit multiple Task tool calls in a single coordinator response. The SDK executes them simultaneously.
✗ Sequential (Slower)
Emit Task tool calls in separate turns. Each subagent waits for the previous to finish before starting.
Passing Context Explicitly
When passing context between agents, use structured data formats that separate content from metadata. Don't pass raw conversation history — extract and pass only what the subagent needs, with source attribution preserved.
# Example: Coordinator passing structured context to synthesis agent
synthesis_prompt = f"""
You are a synthesis agent. Combine the following research findings into a report.
## Web Search Results
{json.dumps(web_results, indent=2)} # includes: claim, source_url, date, excerpt
## Document Analysis Results
{json.dumps(doc_results, indent=2)} # includes: claim, document_name, page, excerpt
Preserve all source attributions in your output.
"""
task_result = await task_tool.invoke(prompt=synthesis_prompt)
fork_session creates independent branches from a shared analysis baseline — used to explore divergent approaches (e.g., comparing two refactoring strategies). This is different from spawning parallel subagents for coordination. Changes in one fork do not affect the other.
Programmatic Enforcement vs Prompt-Based Guidance
This is a key exam distinction: when you need guaranteed compliance (deterministic), you need programmatic enforcement. Prompt instructions have a non-zero failure rate.
| Approach | Reliability | Use When |
|---|---|---|
| Programmatic prerequisite (blocks tool call if precondition not met) | Deterministic — 100% | Financial operations, identity verification, critical business rules |
| Prompt instruction ("always call X before Y") | Probabilistic — ~98% or less | Soft guidelines where occasional deviation is acceptable |
| Few-shot examples showing the correct order | Probabilistic — improved but not guaranteed | Training patterns, improving consistency, not enforcing compliance |
Whenever a question mentions "deterministic compliance," "financial operations," "identity verification before X," or "guaranteed ordering" — the answer is programmatic enforcement, not prompt-based.
Structured Handoff Protocols
When escalating to a human agent mid-process, include a structured handoff summary containing: customer ID, root cause analysis, what was attempted, refund amount (if applicable), recommended next action. Human agents receiving escalations lack access to the full conversation transcript.
Hook Patterns
PostToolUse Hook
Intercepts tool results AFTER execution, BEFORE the model processes them.
Use for: normalizing heterogeneous data formats (Unix timestamps → ISO 8601, numeric status codes → labels, inconsistent field names).
Tool Call Interception Hook
Intercepts outgoing tool calls BEFORE they execute.
Use for: enforcing compliance rules (blocking refunds above $500 threshold, redirecting to human escalation workflow).
PostToolUse = result normalization (model receives clean data). Tool call interception = blocking/redirecting outgoing calls before execution. If a question asks about "blocking a refund exceeding $500," that's interception (pre), NOT PostToolUse.
Hooks provide deterministic guarantees. Prompt instructions provide probabilistic compliance. For business rules that require 100% enforcement, always use hooks.
Prompt Chaining vs Dynamic Decomposition
| Pattern | Structure | Best For |
|---|---|---|
| Prompt Chaining | Fixed sequential pipeline — known steps, predictable structure | Multi-aspect code reviews, structured analysis with predictable sub-tasks |
| Dynamic Adaptive Decomposition | Subtasks generated based on intermediate findings | Open-ended investigation, "add tests to legacy codebase" where scope evolves |
Multi-Pass Code Review Pattern
For large code reviews (14+ files): split into per-file local passes (catch local bugs, style issues) then a separate cross-file integration pass (catch data flow issues, contradictions between files). This prevents attention dilution.
Prompt chaining is poor for tasks where the right next step depends on what was discovered in the current step. Example: "investigate a production bug" — you don't know whether to check logs, code, or database until each step reveals new information. Use dynamic decomposition here.
Session Management
--resume <session-name>
Continues a specific prior conversation with all its history intact. Use when the prior context is mostly still valid and you want to pick up where you left off.
Important: If code has changed since the session, tell the resumed session about the specific files that changed so it re-analyzes them rather than relying on stale observations.
fork_session
Creates independent branches from a shared analysis baseline. Use to explore divergent approaches without contaminating each other (e.g., "compare approach A vs approach B from the same starting analysis").
When to Start Fresh vs Resume
| Scenario | Strategy |
|---|---|
| Prior context still valid, continuing investigation | Use --resume |
| Prior tool results are stale (code was modified) | Start fresh, inject structured summary of still-valid findings |
| Need to explore two different approaches from same baseline | Use fork_session |
| Prior session crashed mid-execution | Start fresh, load coordinator manifest of completed agent states |
For multi-agent systems, each agent exports its state to a known location (a manifest). On restart, the coordinator loads the manifest and injects each agent's last known state into their initial prompt, skipping already-completed work.
Tool Design & MCP Integration
18% of examTool Descriptions Are Everything
Tool descriptions are the primary mechanism LLMs use for tool selection. This is the most important principle of tool design. Minimal descriptions lead to unreliable selection, especially when similar tools exist.
What a Good Tool Description Includes
- The tool's purpose and what it returns
- Expected input formats and examples
- Edge cases and boundary conditions it handles
- When to use this tool vs a similar alternative (explicit differentiation)
- When not to use it
Two tools with near-identical descriptions ("Retrieves customer information" / "Retrieves order details") will cause the model to pick randomly. This is the root cause of most tool selection bugs — fix descriptions before adding routing layers.
Splitting vs Consolidating Tools
A generic analyze_document tool should be split into purpose-specific tools with clear input/output contracts when the use cases are distinct: extract_data_points, summarize_content, verify_claim_against_source. Each has unambiguous selection criteria.
System prompt wording can create unintended tool associations. If your system prompt says "always check the customer record first" and you have a tool called check_customer, the model may call that tool for queries where a different tool is more appropriate. Review system prompts for keyword-sensitive instructions that might override well-written tool descriptions.
Structured Error Responses
Generic errors prevent intelligent recovery. The agent needs to know what kind of failure occurred to decide whether to retry, escalate, continue with partial results, or explain the issue to the user.
| Error Category | Example | Agent Response |
|---|---|---|
| Transient | Timeout, service temporarily unavailable | Retry with backoff (isRetryable: true) |
| Validation | Invalid input format, missing required field | Fix input and retry |
| Business/Policy | Refund exceeds policy limit, unauthorized action | Explain to user, don't retry (isRetryable: false) |
| Permission | Insufficient access, authentication failure | Escalate or request credentials |
# Good structured error response
{
"isError": true,
"errorCategory": "business",
"isRetryable": false,
"message": "Refund amount $650 exceeds the $500 policy limit for automated refunds.",
"customerMessage": "This refund requires supervisor approval due to the amount."
}
# Bad generic error — agent can't determine recovery strategy
{
"error": "Operation failed"
}
A search returning {"results": []} is ambiguous — it could be a successful query with no matches (valid) or a failed query being silently suppressed (access failure). These require completely different agent responses. Always signal this distinction explicitly: use "queryExecuted": true, "matchCount": 0 for valid empty results, and isError: true with a category for failures.
Error Propagation Strategy
- Subagents handle transient failures locally with retry logic
- Propagate to coordinator only what cannot be resolved locally
- When propagating, include: failure type, what was attempted, partial results, alternative approaches tried
Scoped Tool Access
Giving an agent access to too many tools (e.g., 18 tools instead of 4–5) degrades tool selection reliability. Each additional tool increases decision complexity. Agents with tools outside their specialization tend to misuse them.
Each agent should receive only the tools needed for its role. A synthesis agent shouldn't have web search tools — that's the search agent's job. A scoped verify_fact tool for the synthesis agent handles the 85% common case, while complex verifications route through the coordinator to the search agent.
tool_choice Options
| Value | Behavior | Use Case |
|---|---|---|
"auto" |
Model decides whether to call a tool or return text | General conversational agents |
"any" |
Model MUST call a tool (can choose which) | Guarantee structured output when tool list may vary |
{"type":"tool","name":"X"} |
Model MUST call the specific named tool | Force a specific extraction step first (e.g., extract_metadata before enrichment) |
Setting tool_choice: "any" with an empty tools array will cause an API error. The model must have at least one tool available to call. Also: "any" guarantees A tool call, not a SPECIFIC tool — for specific tools use forced selection.
MCP Server Scoping
.mcp.json (Project-Scoped)
Stored in the project root. Version-controlled and shared with all team members who clone the repository. For shared team tooling (e.g., Jira integration, shared database).
~/.claude.json (User-Scoped)
Stored in the user's home directory. Personal and NOT shared via version control. For personal/experimental MCP servers or servers with personal credentials.
Environment Variable Expansion
Use ${GITHUB_TOKEN} syntax in .mcp.json for credential management. This keeps secrets out of version control while allowing the configuration to be shared.
// .mcp.json — project-scoped, version controlled
{
"mcpServers": {
"github": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-server-github"],
"env": {
"GITHUB_TOKEN": "${GITHUB_TOKEN}" // Expanded from env at runtime
}
}
}
}
MCP Resources vs MCP Tools
MCP Resources
Expose content catalogs — issue summaries, documentation hierarchies, database schemas. Give agents visibility into available data without requiring exploratory tool calls. Like a manifest or catalog.
MCP Tools
Execute actions — search, query, write, update. Tools perform operations and return results. They're invoked when the agent needs to do something, not just discover what's available.
Built-in Tool Selection
| Tool | Purpose | When to Use |
|---|---|---|
Grep | Search content inside files | Finding function names, error messages, import statements across codebase |
Glob | Find files by path patterns | Finding all test files (**/*.test.tsx), finding files by extension |
Read | Read full file contents | Understanding a specific file's full implementation |
Write | Write/overwrite full file | Creating new files; fallback when Edit fails |
Edit | Targeted modification by unique text matching | Making specific changes to known sections of existing files |
Bash | Execute shell commands | Running tests, builds, or operations that need a shell |
When Edit fails because the anchor text appears multiple times in the file (non-unique match), the fallback is: Read the full file, make the change in memory, then Write the entire file. Do not use Bash/sed as a workaround.
Don't read all files upfront. Start with Grep to find entry points, then use Read to follow imports and trace flows. This builds understanding incrementally without exhausting context.
Claude Code Configuration & Workflows
20% of examCLAUDE.md Hierarchy
| Level | Location | Scope | Shared via VCS? |
|---|---|---|---|
| User | ~/.claude/CLAUDE.md |
Applies to all sessions for that user only | No — personal |
| Project | .claude/CLAUDE.md or root CLAUDE.md |
Applies to all sessions in the project for all team members | Yes — committed |
| Directory | Subdirectory CLAUDE.md files |
Applies when working in that specific directory | Yes — committed |
A new team member not receiving project standards is typically because those standards were put in ~/.claude/CLAUDE.md (user-level) instead of the project-level file. User-level settings are NOT shared via version control.
@import for Modular Organization
Use @import syntax to reference external files and keep CLAUDE.md modular. Each package's CLAUDE.md can selectively import only the standards files relevant to it.
.claude/rules/ as Alternative
For large CLAUDE.md files (400+ lines), split into focused topic-specific files in .claude/rules/: testing.md, api-conventions.md, deployment.md. Each file can be path-scoped via YAML frontmatter (see Task 3.3).
Slash Commands: Project vs User Scope
.claude/commands/
Project-scoped. Committed to version control. Available to every developer who clones the repo. Use for team-wide workflows like /review, /test.
~/.claude/commands/
User-scoped. Personal. NOT shared via version control. Use for personal productivity shortcuts.
Skills in .claude/skills/
Skills are more powerful than slash commands — they support SKILL.md frontmatter configuration:
---
description: Analyzes codebase architecture and produces a dependency map
context: fork # Run in isolated sub-agent context
allowed-tools: Read, Grep, Glob # Only these tools permitted
argument-hint: "path to analyze" # Prompt user if invoked without args
---
# Your skill prompt/instructions here...
Frontmatter Options
| Option | Effect |
|---|---|
context: fork |
Runs skill in isolated sub-agent context. Verbose output doesn't pollute main conversation. Summary/result returned to main session. |
allowed-tools: [list] |
Restricts which tools the skill can use. Prevents destructive operations (e.g., only allow Read, not Write or Bash). |
argument-hint |
Prompts the developer for required parameters when the skill is invoked without arguments. |
CLAUDE.md = always-loaded universal standards. Skills = on-demand invocation for task-specific workflows. Use CLAUDE.md for things that should always be in context; use skills for things invoked explicitly when needed.
To create a personal variant of a team skill without affecting teammates: create it in ~/.claude/skills/ with a different name. The original team skill in .claude/skills/ is unaffected.
Path-Scoped Rules
Files in .claude/rules/ with YAML frontmatter paths fields load their rules only when editing files that match the glob pattern. This reduces irrelevant context and token usage.
---
paths: ["**/*.test.tsx", "**/*.spec.ts"]
---
# Testing Conventions
- Use React Testing Library, not Enzyme
- Prefer userEvent over fireEvent for interactions
- Each test file should have a describe block matching the component name
Directory CLAUDE.md is directory-bound — it only applies within that specific directory. Path-specific rules with glob patterns can apply conventions to files spread across the entire codebase (e.g., all test files, regardless of where they live). For cross-cutting concerns, always prefer path-specific rules.
Example pattern: test files are spread throughout the codebase next to the components they test. A rule with paths: ["**/*.test.tsx"] catches all of them, while a directory CLAUDE.md in /tests/ would miss the ones in /src/components/.
Plan Mode vs Direct Execution
| Use Plan Mode When... | Use Direct Execution When... |
|---|---|
| Task involves large-scale changes (45+ files) | Simple, well-scoped change (single file bug fix) |
| Multiple valid architectural approaches exist | Clear stack trace pointing to exact problem |
| Architectural decisions need to be made | Adding a single validation check to one function |
| Multi-file modifications with dependencies | Adding a date validation conditional |
| Library migrations affecting many files | Renaming a variable in a single file |
A powerful pattern: use plan mode for investigation and design, then switch to direct execution for implementation. Example: plan mode to design the library migration approach, direct execution to implement the planned changes.
The Explore Subagent
Use the Explore subagent for verbose discovery phases (reading many files, mapping dependencies) to prevent context window exhaustion during multi-phase tasks. The main agent coordinates at a high level while Explore handles the detail work.
Iterative Refinement Techniques
Concrete I/O Examples Over Prose
When prose descriptions produce inconsistent results, provide 2–3 concrete input/output examples. Models interpret examples more reliably than abstract descriptions. "Transform X to Y" is vague; showing an actual X with its corresponding Y is unambiguous.
Test-Driven Iteration
Write test suites first covering expected behavior, edge cases, and performance requirements. Share test failures with Claude to guide progressive improvement. Each failing test communicates precisely what's wrong.
The Interview Pattern
Have Claude ask questions before implementing in unfamiliar domains. This surfaces considerations you may not have anticipated (e.g., cache invalidation strategies, failure modes, edge cases) before they become costly rework.
Interacting vs Independent Issues
Interacting Issues → Batch
When multiple problems interact with each other (fixing A affects B), send all of them in a single detailed message so Claude can address them holistically.
Independent Issues → Sequential
When problems are independent, fix them one at a time. Each fix is cleaner and easier to review.
CI/CD Integration
Non-Interactive Mode: The -p Flag
The -p (or --print) flag runs Claude Code in non-interactive mode: processes the prompt, outputs to stdout, exits without waiting for user input. Required for all CI/CD pipelines.
# CI pipeline example
claude -p "Analyze this pull request for security issues" \
--output-format json \
--json-schema review-schema.json
CLAUDE_HEADLESS=true and --batch are NOT real Claude Code flags. The correct flag is -p / --print. This is a common exam distractor.
Structured Output in CI
Use --output-format json with --json-schema to produce machine-parseable structured findings that can be posted as inline PR comments automatically.
Session Context Isolation for Review
The same Claude session that generated code is less effective at reviewing that code — it retains its reasoning context and is less likely to question its own decisions. Use an independent review instance.
CLAUDE.md in CI Context
Claude Code in CI reads the project's CLAUDE.md automatically, giving it project context (testing standards, fixture conventions, review criteria) without needing to pass it explicitly in every prompt.
Prompt Engineering & Structured Output
20% of examExplicit Criteria Over Vague Instructions
✗ Vague (Doesn't Help)
- "Be conservative"
- "Only report high-confidence findings"
- "Focus on important issues"
- "Use your best judgment"
✓ Explicit (Works)
- "Flag ONLY when claimed behavior contradicts actual code behavior"
- "Skip style preferences and minor formatting. Report: bugs, security, incorrect docs"
- "Flag severity 'critical' when code would cause data loss or security breach"
High false positive rates in one category undermine developer trust across all categories — developers start dismissing real findings because they've been burned by noise. Temporarily disabling high-false-positive categories while improving prompts for them is better than letting them contaminate the signal.
Few-Shot Prompting
When detailed instructions alone produce inconsistent results, few-shot examples are the most effective intervention. They work by demonstrating the desired pattern, including reasoning for why a choice was made.
What Makes Good Few-Shot Examples
- Target ambiguous cases — don't waste examples on obvious cases; show how to handle the tricky ones
- Include reasoning — show WHY tool A was chosen over tool B, not just that it was
- Show the full output format — demonstrate location, severity, message, suggested fix
- Show both positive and negative cases — what to flag AND what to skip (to reduce false positives)
- 2–4 targeted examples is typically sufficient; more adds token overhead without proportional benefit
Few-shot examples enable the model to generalize the underlying reasoning to novel cases — it's not just memorizing which inputs map to which outputs. A model that sees 3 examples of ambiguous tool selection can correctly handle a 4th novel case it has never seen.
For document extraction where source documents vary widely in structure (inline citations vs bibliographies, narrative descriptions vs structured tables), few-shot examples demonstrating correct extraction from varied formats significantly reduce null/empty extraction errors. The model learns to handle structural variety, not just a single document format.
tool_use for Guaranteed Structured Output
Using tool_use with a JSON schema is the most reliable approach for guaranteed schema-compliant output. It eliminates JSON syntax errors.
tool_use eliminates syntax errors (malformed JSON, missing braces). It does NOT eliminate semantic errors: values in wrong fields, totals that don't sum correctly, wrong data types within a valid schema. Semantic validation requires additional logic (e.g., Pydantic validators, cross-field checks).
Schema Design Principles
- Nullable fields for information that may not exist in source documents — prevents model from fabricating values to satisfy required constraints
- "other" + detail string patterns for extensible enum categorization — avoids forced misclassification of novel values
- "unclear" enum value for ambiguous cases — better than hallucinating a confident wrong answer
- Required vs optional — only mark fields required when the information will always be present
{
"name": "extract_invoice",
"input_schema": {
"type": "object",
"required": ["invoice_number", "vendor"],
"properties": {
"invoice_number": {"type": "string"},
"vendor": {"type": "string"},
"total_amount": {"type": ["number", "null"]}, // nullable
"currency": {
"type": "string",
"enum": ["USD", "EUR", "AUD", "other"], // escape hatch
},
"currency_detail": {"type": ["string", "null"]} // detail for "other"
}
}
}
Validation-Retry Loops
When a validation error occurs, append the specific error to the prompt and retry. The model can self-correct when it has the error information. But retry has limits.
Retry WILL Help
- Format mismatches (date in wrong format)
- Values placed in wrong fields
- Structural output errors
- Wrong units used
Retry WON'T Help
- Required information simply absent from the source document
- Source document is in an external file not provided to the model
- Truly ambiguous information with no further clues
Feedback Loop Design
Add a detected_pattern field to structured findings. When developers dismiss findings, log which patterns triggered dismissals. This enables systematic analysis of false positive patterns and targeted prompt improvement.
For financial documents, extract both calculated_total (sum of line items) and stated_total (what the document claims). Flag discrepancies with a conflict_detected boolean. This catches semantic errors that schema validation alone cannot.
Message Batches API
| Property | Value |
|---|---|
| Cost savings | 50% vs synchronous API |
| Processing window | Up to 24 hours (no guaranteed SLA) |
| Multi-turn tool calling | Not supported within a single batch request |
| Request correlation | custom_id field maps responses to requests |
Good for batch: latency-tolerant, non-blocking workloads — overnight reports, weekly audits, nightly test generation, async enrichment pipelines.
Bad for batch: any workflow where something or someone is waiting — pre-merge checks, real-time chatbots, fraud detection.
If you have a 30-hour SLA and the batch API takes up to 24 hours: submit batches at least every 6 hours (30 - 24 = 6-hour submission window). The 24-hour processing time is a maximum, not a guarantee.
Handling Batch Failures
Use custom_id fields to identify which specific requests failed. Resubmit only failed documents with appropriate modifications (e.g., chunking documents that exceeded context limits). Don't resubmit the entire batch.
Self-Review Limitations
A model that generated code retains its reasoning context from generation. This makes it less likely to question its own decisions during review — it already "knows" why it wrote things the way it did.
Even with extended thinking enabled, the model operates with the same generation reasoning context. The bias is about context retention, not thinking depth. An independent instance with NO prior reasoning context is more effective.
Multi-Pass Review Architecture
Attention dilution is not a context window problem — it's a quality problem. A model analyzing 14 files in one pass gives superficial attention to middle files regardless of how large the context window is. Per-file passes guarantee consistent depth.
Context Management & Reliability
15% of examContext Window Challenges
The "Lost in the Middle" Effect
Models reliably process information at the beginning and end of long inputs but may miss findings positioned in the middle. This is a fundamental attention characteristic, not a bug.
Place key findings summaries at the beginning of aggregated inputs. Use explicit section headers for middle content. Don't bury critical information in the middle of long tool result dumps.
Progressive Summarization Risks
Summarization loses precision. These specific data types get condensed to vague references:
- Numerical values ("$127.43 refund" → "a refund issue")
- Dates and timestamps
- Order/case/reference numbers
- Customer-stated specific expectations
Solution: Extract transactional facts into a persistent "case facts" block included in every prompt, outside the summarized history.
Verbose Tool Output Trimming
Tool results accumulate in context and consume tokens disproportionately to their relevance. A customer order lookup returning 40 fields when only 5 are needed wastes 35 fields of context budget on every subsequent prompt.
Escalation Triggers
| Trigger | Correct Behavior |
|---|---|
| Customer explicitly requests a human agent | Escalate immediately. Do NOT investigate first or offer to resolve. Honor the request instantly. |
| Policy gap (policy is silent on the specific request) | Escalate. Even if the case seems simple, the agent shouldn't extrapolate policy by analogy. |
| Unable to make meaningful progress after N attempts | Escalate. Don't loop indefinitely. |
| Customer appears frustrated (negative sentiment) | Do NOT automatically escalate. Sentiment is an unreliable proxy for case complexity. |
| Self-reported confidence score below threshold | Do NOT use as sole trigger. Self-reported confidence is poorly calibrated. |
Negative sentiment doesn't correlate with case complexity or resolution difficulty. An angry customer with a simple return request should be resolved, not escalated. A politely-worded case requiring a policy exception should be escalated. Escalate on case characteristics, not emotional signals.
Multiple Identity Matches
When a lookup returns multiple customers matching the provided name, ask for an additional identifier (email, order number, account ID) rather than heuristically selecting one. Guessing risks modifying the wrong account.
Error Propagation Anti-Patterns
✗ Silent Suppression
Catching a timeout and returning {"results": []} with no error signal. The coordinator thinks the search succeeded with no results and proceeds accordingly — incomplete research, no recovery.
✗ Workflow Termination
Propagating a single subagent failure as a fatal error that terminates the entire workflow. Loses all results from successful subagents unnecessarily.
Subagents should: (1) attempt local recovery for transient failures; (2) if unresolvable, return structured error context including failure type, what was attempted, partial results obtained, and alternative approaches considered; (3) coordinator uses this to make intelligent recovery decisions.
Access Failure vs Valid Empty Result
This distinction must be explicit in your error responses. An empty result set can mean two completely different things requiring opposite responses:
- Valid empty result: Query executed successfully, no matching records exist. No retry needed.
- Access failure: Query couldn't execute (timeout, permission error). Retry decision needed.
Context Degradation in Extended Sessions
Context degradation is a gradual failure mode: the model starts giving inconsistent answers, referencing "typical patterns" instead of specific classes discovered earlier in the session, or contradicting its own prior findings.
Strategies
- Scratchpad files — agents maintain files recording key findings. Reference these files (not memory) for subsequent questions, so findings persist across context boundaries.
- Subagent delegation — spawn Explore subagents to handle verbose exploration. Main agent coordinates at high level without filling its context with raw file contents.
- Phase summarization — before spawning next-phase agents, summarize findings from the current phase and inject the summary into new agent prompts.
/compact— reduces context usage during extended exploration sessions when context fills with verbose discovery output.
Each agent exports its state to a known location. The coordinator maintains a manifest tracking which agents have completed which tasks. On restart, the coordinator loads the manifest and resumes from the last known state, injecting completed agent findings into new agent prompts.
Human Review Routing
A system with 97% overall accuracy may be 99.9% accurate on invoices but 58% accurate on handwritten forms. Automating based on overall accuracy removes human review from the failing segment. Always analyze accuracy by document type AND field segment before reducing human review.
Confidence Calibration
- Field-level confidence scores (not document-level) allow granular routing — route only the specific fields with low confidence to human review, not the entire document
- Calibrate thresholds using labeled validation sets — not the model's own self-reported confidence, which is poorly calibrated out of the box
- Stratified random sampling of high-confidence extractions for ongoing error rate measurement and novel pattern detection
Information Provenance
Source attribution gets lost during summarization steps. Once claims are separated from their sources, attribution cannot be recovered. The solution: require structured claim-source mappings from subagents, and preserve them through all synthesis steps.
# Required subagent output structure
{
"findings": [
{
"claim": "AI adoption in creative industries grew 34% in 2023",
"source_url": "https://...",
"document_name": "TechTrends2024.pdf",
"page": 12,
"excerpt": "Our survey of 500 creative professionals...",
"publication_date": "2024-01-15"
}
]
}
Handling Conflicting Sources
When two credible sources report different statistics, the synthesis agent should include BOTH values with their source attributions and explicitly note the conflict. The coordinator (or human) decides how to reconcile. Publication dates are often the explanation — what looks like a contradiction may be temporal data from different years.
Temporal Data
Require publication_date and data_collection_date in structured outputs. Without dates, a conflict between "AI adoption: 43%" (2019) and "AI adoption: 67%" (2023) looks like contradictory data but is actually consistent temporal data.