Domain 1

Agentic Architecture & Orchestration

27% of exam

The heaviest domain. Focuses on how Claude-powered agents are designed to operate autonomously: the core agentic loop, multi-agent coordination patterns, context passing, workflow enforcement, and session management.

Task 1.1 — Design and implement agentic loops for autonomous task execution

The Agentic Loop

An agentic loop is the mechanism that lets Claude autonomously execute multi-step tasks by iteratively calling tools and observing results. The loop continues until Claude decides it has finished.

Send request to Claude with tools list and conversation history
Receive response — inspect stop_reason
If stop_reason === "tool_use": execute requested tools, append results to history → go to step 1
If stop_reason === "end_turn": present final response to user → loop ends
# Minimal agentic loop (Python-style pseudocode)
messages = [{"role": "user", "content": user_input}]

while True:
    response = client.messages.create(model=model, tools=tools, messages=messages)

    if response.stop_reason == "end_turn":
        # Done — return the assistant's text
        return response.content[0].text

    if response.stop_reason == "tool_use":
        # Execute tools and append results
        messages.append({"role": "assistant", "content": response.content})
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                tool_results.append({
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result
                })
        messages.append({"role": "user", "content": tool_results})
Anti-Patterns — These Will Fail You on the Exam
  • Parsing text for completion signals — checking if the assistant said "Done!" or "Task complete" is unreliable. Claude may say these mid-task.
  • Fixed iteration caps as primary stopping mechanism — a hard limit of "max 10 iterations" is a safety guard, not the primary termination signal. stop_reason is.
  • Checking assistant text content to decide whether to continue — always use stop_reason.
Key Distinction

The loop continues when stop_reason === "tool_use" and terminates when stop_reason === "end_turn". These are the only two stop reasons you need to handle in basic agentic loops.

Task 1.2 — Orchestrate multi-agent systems with coordinator-subagent patterns

Hub-and-Spoke Multi-Agent Architecture

In a multi-agent system, a coordinator agent manages all communication between specialist subagents. All inter-agent communication routes through the coordinator — subagents never communicate directly with each other.

Critical: Subagents Have Isolated Context

Subagents do NOT automatically inherit the coordinator's conversation history or memory. Every piece of context the subagent needs must be explicitly included in the Task prompt. This is one of the most tested concepts.

Coordinator Responsibilities

  • Task decomposition — breaking the goal into subtasks based on query complexity
  • Delegation — deciding which subagent handles which subtask
  • Result aggregation — combining subagent outputs into a coherent final result
  • Error handling & recovery — deciding what to do when subagents fail
Narrow Decomposition Risk

If the coordinator decomposes a broad topic too narrowly, the entire system produces incomplete coverage — even if every subagent executes perfectly. Example: decomposing "AI in creative industries" into only visual arts subtasks omits music, film, writing. The subagents can't cover what they were never assigned.

Iterative Refinement Loop

After synthesis, the coordinator can evaluate output quality, identify gaps, re-delegate targeted queries to search/analysis agents, and re-invoke synthesis. This iterative refinement loop is the production-grade pattern for research systems.

Task 1.3 — Configure subagent invocation, context passing, and spawning

Spawning Subagents with the Task Tool

The Task tool is the mechanism for spawning subagents in the Claude Agent SDK. The coordinator's allowedTools list must include "Task" for it to spawn subagents — without this, subagent invocation fails.

Parallel vs Sequential Subagent Execution

✓ Parallel (Correct)

Emit multiple Task tool calls in a single coordinator response. The SDK executes them simultaneously.

✗ Sequential (Slower)

Emit Task tool calls in separate turns. Each subagent waits for the previous to finish before starting.

Passing Context Explicitly

When passing context between agents, use structured data formats that separate content from metadata. Don't pass raw conversation history — extract and pass only what the subagent needs, with source attribution preserved.

# Example: Coordinator passing structured context to synthesis agent
synthesis_prompt = f"""
You are a synthesis agent. Combine the following research findings into a report.

## Web Search Results
{json.dumps(web_results, indent=2)}   # includes: claim, source_url, date, excerpt

## Document Analysis Results
{json.dumps(doc_results, indent=2)}   # includes: claim, document_name, page, excerpt

Preserve all source attributions in your output.
"""
task_result = await task_tool.invoke(prompt=synthesis_prompt)
Edge Case: Fork-Based Session Management

fork_session creates independent branches from a shared analysis baseline — used to explore divergent approaches (e.g., comparing two refactoring strategies). This is different from spawning parallel subagents for coordination. Changes in one fork do not affect the other.

Task 1.4 — Implement multi-step workflows with enforcement and handoff patterns

Programmatic Enforcement vs Prompt-Based Guidance

This is a key exam distinction: when you need guaranteed compliance (deterministic), you need programmatic enforcement. Prompt instructions have a non-zero failure rate.

ApproachReliabilityUse When
Programmatic prerequisite (blocks tool call if precondition not met) Deterministic — 100% Financial operations, identity verification, critical business rules
Prompt instruction ("always call X before Y") Probabilistic — ~98% or less Soft guidelines where occasional deviation is acceptable
Few-shot examples showing the correct order Probabilistic — improved but not guaranteed Training patterns, improving consistency, not enforcing compliance
Exam Rule

Whenever a question mentions "deterministic compliance," "financial operations," "identity verification before X," or "guaranteed ordering" — the answer is programmatic enforcement, not prompt-based.

Structured Handoff Protocols

When escalating to a human agent mid-process, include a structured handoff summary containing: customer ID, root cause analysis, what was attempted, refund amount (if applicable), recommended next action. Human agents receiving escalations lack access to the full conversation transcript.

Task 1.5 — Apply Agent SDK hooks for tool call interception and data normalization

Hook Patterns

PostToolUse Hook

Intercepts tool results AFTER execution, BEFORE the model processes them.

Use for: normalizing heterogeneous data formats (Unix timestamps → ISO 8601, numeric status codes → labels, inconsistent field names).

Tool Call Interception Hook

Intercepts outgoing tool calls BEFORE they execute.

Use for: enforcing compliance rules (blocking refunds above $500 threshold, redirecting to human escalation workflow).

Don't Confuse PostToolUse and PreToolUse

PostToolUse = result normalization (model receives clean data). Tool call interception = blocking/redirecting outgoing calls before execution. If a question asks about "blocking a refund exceeding $500," that's interception (pre), NOT PostToolUse.

Hooks vs Prompt Instructions

Hooks provide deterministic guarantees. Prompt instructions provide probabilistic compliance. For business rules that require 100% enforcement, always use hooks.

Task 1.6 — Design task decomposition strategies for complex workflows

Prompt Chaining vs Dynamic Decomposition

PatternStructureBest For
Prompt Chaining Fixed sequential pipeline — known steps, predictable structure Multi-aspect code reviews, structured analysis with predictable sub-tasks
Dynamic Adaptive Decomposition Subtasks generated based on intermediate findings Open-ended investigation, "add tests to legacy codebase" where scope evolves

Multi-Pass Code Review Pattern

For large code reviews (14+ files): split into per-file local passes (catch local bugs, style issues) then a separate cross-file integration pass (catch data flow issues, contradictions between files). This prevents attention dilution.

Edge Case: When Prompt Chaining Fails

Prompt chaining is poor for tasks where the right next step depends on what was discovered in the current step. Example: "investigate a production bug" — you don't know whether to check logs, code, or database until each step reveals new information. Use dynamic decomposition here.

Task 1.7 — Manage session state, resumption, and forking

Session Management

--resume <session-name>

Continues a specific prior conversation with all its history intact. Use when the prior context is mostly still valid and you want to pick up where you left off.

Important: If code has changed since the session, tell the resumed session about the specific files that changed so it re-analyzes them rather than relying on stale observations.

fork_session

Creates independent branches from a shared analysis baseline. Use to explore divergent approaches without contaminating each other (e.g., "compare approach A vs approach B from the same starting analysis").

When to Start Fresh vs Resume

ScenarioStrategy
Prior context still valid, continuing investigationUse --resume
Prior tool results are stale (code was modified)Start fresh, inject structured summary of still-valid findings
Need to explore two different approaches from same baselineUse fork_session
Prior session crashed mid-executionStart fresh, load coordinator manifest of completed agent states
Crash Recovery Pattern

For multi-agent systems, each agent exports its state to a known location (a manifest). On restart, the coordinator loads the manifest and injects each agent's last known state into their initial prompt, skipping already-completed work.

Domain 2

Tool Design & MCP Integration

18% of exam
Task 2.1 — Design effective tool interfaces with clear descriptions and boundaries

Tool Descriptions Are Everything

Tool descriptions are the primary mechanism LLMs use for tool selection. This is the most important principle of tool design. Minimal descriptions lead to unreliable selection, especially when similar tools exist.

What a Good Tool Description Includes

  • The tool's purpose and what it returns
  • Expected input formats and examples
  • Edge cases and boundary conditions it handles
  • When to use this tool vs a similar alternative (explicit differentiation)
  • When not to use it
Ambiguous Descriptions Cause Misrouting

Two tools with near-identical descriptions ("Retrieves customer information" / "Retrieves order details") will cause the model to pick randomly. This is the root cause of most tool selection bugs — fix descriptions before adding routing layers.

Splitting vs Consolidating Tools

A generic analyze_document tool should be split into purpose-specific tools with clear input/output contracts when the use cases are distinct: extract_data_points, summarize_content, verify_claim_against_source. Each has unambiguous selection criteria.

Edge Case: System Prompt Interference

System prompt wording can create unintended tool associations. If your system prompt says "always check the customer record first" and you have a tool called check_customer, the model may call that tool for queries where a different tool is more appropriate. Review system prompts for keyword-sensitive instructions that might override well-written tool descriptions.

Task 2.2 — Implement structured error responses for MCP tools

Structured Error Responses

Generic errors prevent intelligent recovery. The agent needs to know what kind of failure occurred to decide whether to retry, escalate, continue with partial results, or explain the issue to the user.

Error CategoryExampleAgent Response
Transient Timeout, service temporarily unavailable Retry with backoff (isRetryable: true)
Validation Invalid input format, missing required field Fix input and retry
Business/Policy Refund exceeds policy limit, unauthorized action Explain to user, don't retry (isRetryable: false)
Permission Insufficient access, authentication failure Escalate or request credentials
# Good structured error response
{
  "isError": true,
  "errorCategory": "business",
  "isRetryable": false,
  "message": "Refund amount $650 exceeds the $500 policy limit for automated refunds.",
  "customerMessage": "This refund requires supervisor approval due to the amount."
}

# Bad generic error — agent can't determine recovery strategy
{
  "error": "Operation failed"
}
Empty Results vs Access Failures — Must Be Distinguished

A search returning {"results": []} is ambiguous — it could be a successful query with no matches (valid) or a failed query being silently suppressed (access failure). These require completely different agent responses. Always signal this distinction explicitly: use "queryExecuted": true, "matchCount": 0 for valid empty results, and isError: true with a category for failures.

Error Propagation Strategy

  • Subagents handle transient failures locally with retry logic
  • Propagate to coordinator only what cannot be resolved locally
  • When propagating, include: failure type, what was attempted, partial results, alternative approaches tried
Task 2.3 — Distribute tools appropriately across agents and configure tool choice

Scoped Tool Access

Giving an agent access to too many tools (e.g., 18 tools instead of 4–5) degrades tool selection reliability. Each additional tool increases decision complexity. Agents with tools outside their specialization tend to misuse them.

Principle of Least Privilege for Tools

Each agent should receive only the tools needed for its role. A synthesis agent shouldn't have web search tools — that's the search agent's job. A scoped verify_fact tool for the synthesis agent handles the 85% common case, while complex verifications route through the coordinator to the search agent.

tool_choice Options

ValueBehaviorUse Case
"auto" Model decides whether to call a tool or return text General conversational agents
"any" Model MUST call a tool (can choose which) Guarantee structured output when tool list may vary
{"type":"tool","name":"X"} Model MUST call the specific named tool Force a specific extraction step first (e.g., extract_metadata before enrichment)
Edge Case: "any" Requires at Least One Tool

Setting tool_choice: "any" with an empty tools array will cause an API error. The model must have at least one tool available to call. Also: "any" guarantees A tool call, not a SPECIFIC tool — for specific tools use forced selection.

Task 2.4 — Integrate MCP servers into Claude Code and agent workflows

MCP Server Scoping

.mcp.json (Project-Scoped)

Stored in the project root. Version-controlled and shared with all team members who clone the repository. For shared team tooling (e.g., Jira integration, shared database).

~/.claude.json (User-Scoped)

Stored in the user's home directory. Personal and NOT shared via version control. For personal/experimental MCP servers or servers with personal credentials.

Environment Variable Expansion

Use ${GITHUB_TOKEN} syntax in .mcp.json for credential management. This keeps secrets out of version control while allowing the configuration to be shared.

// .mcp.json — project-scoped, version controlled
{
  "mcpServers": {
    "github": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-github"],
      "env": {
        "GITHUB_TOKEN": "${GITHUB_TOKEN}"   // Expanded from env at runtime
      }
    }
  }
}

MCP Resources vs MCP Tools

MCP Resources

Expose content catalogs — issue summaries, documentation hierarchies, database schemas. Give agents visibility into available data without requiring exploratory tool calls. Like a manifest or catalog.

MCP Tools

Execute actions — search, query, write, update. Tools perform operations and return results. They're invoked when the agent needs to do something, not just discover what's available.

Task 2.5 — Select and apply built-in tools (Read, Write, Edit, Bash, Grep, Glob) effectively

Built-in Tool Selection

ToolPurposeWhen to Use
GrepSearch content inside filesFinding function names, error messages, import statements across codebase
GlobFind files by path patternsFinding all test files (**/*.test.tsx), finding files by extension
ReadRead full file contentsUnderstanding a specific file's full implementation
WriteWrite/overwrite full fileCreating new files; fallback when Edit fails
EditTargeted modification by unique text matchingMaking specific changes to known sections of existing files
BashExecute shell commandsRunning tests, builds, or operations that need a shell
Edit → Read + Write Fallback

When Edit fails because the anchor text appears multiple times in the file (non-unique match), the fallback is: Read the full file, make the change in memory, then Write the entire file. Do not use Bash/sed as a workaround.

Incremental Codebase Exploration Pattern

Don't read all files upfront. Start with Grep to find entry points, then use Read to follow imports and trace flows. This builds understanding incrementally without exhausting context.

Domain 3

Claude Code Configuration & Workflows

20% of exam
Task 3.1 — Configure CLAUDE.md files with appropriate hierarchy, scoping, and modular organization

CLAUDE.md Hierarchy

LevelLocationScopeShared via VCS?
User ~/.claude/CLAUDE.md Applies to all sessions for that user only No — personal
Project .claude/CLAUDE.md or root CLAUDE.md Applies to all sessions in the project for all team members Yes — committed
Directory Subdirectory CLAUDE.md files Applies when working in that specific directory Yes — committed
Common Misconfiguration

A new team member not receiving project standards is typically because those standards were put in ~/.claude/CLAUDE.md (user-level) instead of the project-level file. User-level settings are NOT shared via version control.

@import for Modular Organization

Use @import syntax to reference external files and keep CLAUDE.md modular. Each package's CLAUDE.md can selectively import only the standards files relevant to it.

.claude/rules/ as Alternative

For large CLAUDE.md files (400+ lines), split into focused topic-specific files in .claude/rules/: testing.md, api-conventions.md, deployment.md. Each file can be path-scoped via YAML frontmatter (see Task 3.3).

Task 3.2 — Create and configure custom slash commands and skills

Slash Commands: Project vs User Scope

.claude/commands/

Project-scoped. Committed to version control. Available to every developer who clones the repo. Use for team-wide workflows like /review, /test.

~/.claude/commands/

User-scoped. Personal. NOT shared via version control. Use for personal productivity shortcuts.

Skills in .claude/skills/

Skills are more powerful than slash commands — they support SKILL.md frontmatter configuration:

---
description: Analyzes codebase architecture and produces a dependency map
context: fork                    # Run in isolated sub-agent context
allowed-tools: Read, Grep, Glob  # Only these tools permitted
argument-hint: "path to analyze" # Prompt user if invoked without args
---

# Your skill prompt/instructions here...

Frontmatter Options

OptionEffect
context: fork Runs skill in isolated sub-agent context. Verbose output doesn't pollute main conversation. Summary/result returned to main session.
allowed-tools: [list] Restricts which tools the skill can use. Prevents destructive operations (e.g., only allow Read, not Write or Bash).
argument-hint Prompts the developer for required parameters when the skill is invoked without arguments.
Skills vs CLAUDE.md

CLAUDE.md = always-loaded universal standards. Skills = on-demand invocation for task-specific workflows. Use CLAUDE.md for things that should always be in context; use skills for things invoked explicitly when needed.

Edge Case: Personal Skill Variants

To create a personal variant of a team skill without affecting teammates: create it in ~/.claude/skills/ with a different name. The original team skill in .claude/skills/ is unaffected.

Task 3.3 — Apply path-specific rules for conditional convention loading

Path-Scoped Rules

Files in .claude/rules/ with YAML frontmatter paths fields load their rules only when editing files that match the glob pattern. This reduces irrelevant context and token usage.

---
paths: ["**/*.test.tsx", "**/*.spec.ts"]
---

# Testing Conventions
- Use React Testing Library, not Enzyme
- Prefer userEvent over fireEvent for interactions
- Each test file should have a describe block matching the component name
Path-Specific Rules vs Directory CLAUDE.md

Directory CLAUDE.md is directory-bound — it only applies within that specific directory. Path-specific rules with glob patterns can apply conventions to files spread across the entire codebase (e.g., all test files, regardless of where they live). For cross-cutting concerns, always prefer path-specific rules.

Example pattern: test files are spread throughout the codebase next to the components they test. A rule with paths: ["**/*.test.tsx"] catches all of them, while a directory CLAUDE.md in /tests/ would miss the ones in /src/components/.

Task 3.4 — Determine when to use plan mode vs direct execution

Plan Mode vs Direct Execution

Use Plan Mode When...Use Direct Execution When...
Task involves large-scale changes (45+ files) Simple, well-scoped change (single file bug fix)
Multiple valid architectural approaches exist Clear stack trace pointing to exact problem
Architectural decisions need to be made Adding a single validation check to one function
Multi-file modifications with dependencies Adding a date validation conditional
Library migrations affecting many files Renaming a variable in a single file
Combining Both Modes

A powerful pattern: use plan mode for investigation and design, then switch to direct execution for implementation. Example: plan mode to design the library migration approach, direct execution to implement the planned changes.

The Explore Subagent

Use the Explore subagent for verbose discovery phases (reading many files, mapping dependencies) to prevent context window exhaustion during multi-phase tasks. The main agent coordinates at a high level while Explore handles the detail work.

Task 3.5 — Apply iterative refinement techniques for progressive improvement

Iterative Refinement Techniques

Concrete I/O Examples Over Prose

When prose descriptions produce inconsistent results, provide 2–3 concrete input/output examples. Models interpret examples more reliably than abstract descriptions. "Transform X to Y" is vague; showing an actual X with its corresponding Y is unambiguous.

Test-Driven Iteration

Write test suites first covering expected behavior, edge cases, and performance requirements. Share test failures with Claude to guide progressive improvement. Each failing test communicates precisely what's wrong.

The Interview Pattern

Have Claude ask questions before implementing in unfamiliar domains. This surfaces considerations you may not have anticipated (e.g., cache invalidation strategies, failure modes, edge cases) before they become costly rework.

Interacting vs Independent Issues

Interacting Issues → Batch

When multiple problems interact with each other (fixing A affects B), send all of them in a single detailed message so Claude can address them holistically.

Independent Issues → Sequential

When problems are independent, fix them one at a time. Each fix is cleaner and easier to review.

Task 3.6 — Integrate Claude Code into CI/CD pipelines

CI/CD Integration

Non-Interactive Mode: The -p Flag

The -p (or --print) flag runs Claude Code in non-interactive mode: processes the prompt, outputs to stdout, exits without waiting for user input. Required for all CI/CD pipelines.

# CI pipeline example
claude -p "Analyze this pull request for security issues" \
  --output-format json \
  --json-schema review-schema.json
Non-Existent Flags

CLAUDE_HEADLESS=true and --batch are NOT real Claude Code flags. The correct flag is -p / --print. This is a common exam distractor.

Structured Output in CI

Use --output-format json with --json-schema to produce machine-parseable structured findings that can be posted as inline PR comments automatically.

Session Context Isolation for Review

The same Claude session that generated code is less effective at reviewing that code — it retains its reasoning context and is less likely to question its own decisions. Use an independent review instance.

CLAUDE.md in CI Context

Claude Code in CI reads the project's CLAUDE.md automatically, giving it project context (testing standards, fixture conventions, review criteria) without needing to pass it explicitly in every prompt.

Domain 4

Prompt Engineering & Structured Output

20% of exam
Task 4.1 — Design prompts with explicit criteria to improve precision and reduce false positives

Explicit Criteria Over Vague Instructions

✗ Vague (Doesn't Help)

  • "Be conservative"
  • "Only report high-confidence findings"
  • "Focus on important issues"
  • "Use your best judgment"

✓ Explicit (Works)

  • "Flag ONLY when claimed behavior contradicts actual code behavior"
  • "Skip style preferences and minor formatting. Report: bugs, security, incorrect docs"
  • "Flag severity 'critical' when code would cause data loss or security breach"
False Positives Erode Trust

High false positive rates in one category undermine developer trust across all categories — developers start dismissing real findings because they've been burned by noise. Temporarily disabling high-false-positive categories while improving prompts for them is better than letting them contaminate the signal.

Task 4.2 — Apply few-shot prompting to improve output consistency and quality

Few-Shot Prompting

When detailed instructions alone produce inconsistent results, few-shot examples are the most effective intervention. They work by demonstrating the desired pattern, including reasoning for why a choice was made.

What Makes Good Few-Shot Examples

  • Target ambiguous cases — don't waste examples on obvious cases; show how to handle the tricky ones
  • Include reasoning — show WHY tool A was chosen over tool B, not just that it was
  • Show the full output format — demonstrate location, severity, message, suggested fix
  • Show both positive and negative cases — what to flag AND what to skip (to reduce false positives)
  • 2–4 targeted examples is typically sufficient; more adds token overhead without proportional benefit
Generalization, Not Pattern Matching

Few-shot examples enable the model to generalize the underlying reasoning to novel cases — it's not just memorizing which inputs map to which outputs. A model that sees 3 examples of ambiguous tool selection can correctly handle a 4th novel case it has never seen.

Edge Case: Few-shot for Extraction Variety

For document extraction where source documents vary widely in structure (inline citations vs bibliographies, narrative descriptions vs structured tables), few-shot examples demonstrating correct extraction from varied formats significantly reduce null/empty extraction errors. The model learns to handle structural variety, not just a single document format.

Task 4.3 — Enforce structured output using tool use and JSON schemas

tool_use for Guaranteed Structured Output

Using tool_use with a JSON schema is the most reliable approach for guaranteed schema-compliant output. It eliminates JSON syntax errors.

What tool_use Does NOT Eliminate

tool_use eliminates syntax errors (malformed JSON, missing braces). It does NOT eliminate semantic errors: values in wrong fields, totals that don't sum correctly, wrong data types within a valid schema. Semantic validation requires additional logic (e.g., Pydantic validators, cross-field checks).

Schema Design Principles

  • Nullable fields for information that may not exist in source documents — prevents model from fabricating values to satisfy required constraints
  • "other" + detail string patterns for extensible enum categorization — avoids forced misclassification of novel values
  • "unclear" enum value for ambiguous cases — better than hallucinating a confident wrong answer
  • Required vs optional — only mark fields required when the information will always be present
{
  "name": "extract_invoice",
  "input_schema": {
    "type": "object",
    "required": ["invoice_number", "vendor"],
    "properties": {
      "invoice_number": {"type": "string"},
      "vendor": {"type": "string"},
      "total_amount": {"type": ["number", "null"]},  // nullable
      "currency": {
        "type": "string",
        "enum": ["USD", "EUR", "AUD", "other"],       // escape hatch
      },
      "currency_detail": {"type": ["string", "null"]} // detail for "other"
    }
  }
}
Task 4.4 — Implement validation, retry, and feedback loops for extraction quality

Validation-Retry Loops

When a validation error occurs, append the specific error to the prompt and retry. The model can self-correct when it has the error information. But retry has limits.

Retry WILL Help

  • Format mismatches (date in wrong format)
  • Values placed in wrong fields
  • Structural output errors
  • Wrong units used

Retry WON'T Help

  • Required information simply absent from the source document
  • Source document is in an external file not provided to the model
  • Truly ambiguous information with no further clues

Feedback Loop Design

Add a detected_pattern field to structured findings. When developers dismiss findings, log which patterns triggered dismissals. This enables systematic analysis of false positive patterns and targeted prompt improvement.

Self-Correction Validation Schema Pattern

For financial documents, extract both calculated_total (sum of line items) and stated_total (what the document claims). Flag discrepancies with a conflict_detected boolean. This catches semantic errors that schema validation alone cannot.

Task 4.5 — Design efficient batch processing strategies

Message Batches API

PropertyValue
Cost savings50% vs synchronous API
Processing windowUp to 24 hours (no guaranteed SLA)
Multi-turn tool callingNot supported within a single batch request
Request correlationcustom_id field maps responses to requests
Batch API Appropriateness Test

Good for batch: latency-tolerant, non-blocking workloads — overnight reports, weekly audits, nightly test generation, async enrichment pipelines.
Bad for batch: any workflow where something or someone is waiting — pre-merge checks, real-time chatbots, fraud detection.

SLA Calculation

If you have a 30-hour SLA and the batch API takes up to 24 hours: submit batches at least every 6 hours (30 - 24 = 6-hour submission window). The 24-hour processing time is a maximum, not a guarantee.

Handling Batch Failures

Use custom_id fields to identify which specific requests failed. Resubmit only failed documents with appropriate modifications (e.g., chunking documents that exceeded context limits). Don't resubmit the entire batch.

Task 4.6 — Design multi-instance and multi-pass review architectures

Self-Review Limitations

A model that generated code retains its reasoning context from generation. This makes it less likely to question its own decisions during review — it already "knows" why it wrote things the way it did.

Extended Thinking Doesn't Fix Self-Review Bias

Even with extended thinking enabled, the model operates with the same generation reasoning context. The bias is about context retention, not thinking depth. An independent instance with NO prior reasoning context is more effective.

Multi-Pass Review Architecture

Per-file local passes — each file reviewed independently for local issues (bugs, style, security)
Cross-file integration pass — examines data flow between files, checks for contradictions and inconsistencies
(Optional) Confidence-reporting pass — model outputs confidence score alongside each finding for routing to human review
Why Not Just Use a Bigger Context Window?

Attention dilution is not a context window problem — it's a quality problem. A model analyzing 14 files in one pass gives superficial attention to middle files regardless of how large the context window is. Per-file passes guarantee consistent depth.

Domain 5

Context Management & Reliability

15% of exam
Task 5.1 — Manage conversation context to preserve critical information across long interactions

Context Window Challenges

The "Lost in the Middle" Effect

Models reliably process information at the beginning and end of long inputs but may miss findings positioned in the middle. This is a fundamental attention characteristic, not a bug.

Mitigation: Position Key Findings at Start or End

Place key findings summaries at the beginning of aggregated inputs. Use explicit section headers for middle content. Don't bury critical information in the middle of long tool result dumps.

Progressive Summarization Risks

Summarization loses precision. These specific data types get condensed to vague references:

  • Numerical values ("$127.43 refund" → "a refund issue")
  • Dates and timestamps
  • Order/case/reference numbers
  • Customer-stated specific expectations

Solution: Extract transactional facts into a persistent "case facts" block included in every prompt, outside the summarized history.

Verbose Tool Output Trimming

Tool results accumulate in context and consume tokens disproportionately to their relevance. A customer order lookup returning 40 fields when only 5 are needed wastes 35 fields of context budget on every subsequent prompt.

Task 5.2 — Design effective escalation and ambiguity resolution patterns

Escalation Triggers

TriggerCorrect Behavior
Customer explicitly requests a human agent Escalate immediately. Do NOT investigate first or offer to resolve. Honor the request instantly.
Policy gap (policy is silent on the specific request) Escalate. Even if the case seems simple, the agent shouldn't extrapolate policy by analogy.
Unable to make meaningful progress after N attempts Escalate. Don't loop indefinitely.
Customer appears frustrated (negative sentiment) Do NOT automatically escalate. Sentiment is an unreliable proxy for case complexity.
Self-reported confidence score below threshold Do NOT use as sole trigger. Self-reported confidence is poorly calibrated.
Sentiment-Based Escalation is an Anti-Pattern

Negative sentiment doesn't correlate with case complexity or resolution difficulty. An angry customer with a simple return request should be resolved, not escalated. A politely-worded case requiring a policy exception should be escalated. Escalate on case characteristics, not emotional signals.

Multiple Identity Matches

When a lookup returns multiple customers matching the provided name, ask for an additional identifier (email, order number, account ID) rather than heuristically selecting one. Guessing risks modifying the wrong account.

Task 5.3 — Implement error propagation strategies across multi-agent systems

Error Propagation Anti-Patterns

✗ Silent Suppression

Catching a timeout and returning {"results": []} with no error signal. The coordinator thinks the search succeeded with no results and proceeds accordingly — incomplete research, no recovery.

✗ Workflow Termination

Propagating a single subagent failure as a fatal error that terminates the entire workflow. Loses all results from successful subagents unnecessarily.

Correct Pattern: Structured Error Context + Partial Results

Subagents should: (1) attempt local recovery for transient failures; (2) if unresolvable, return structured error context including failure type, what was attempted, partial results obtained, and alternative approaches considered; (3) coordinator uses this to make intelligent recovery decisions.

Access Failure vs Valid Empty Result

This distinction must be explicit in your error responses. An empty result set can mean two completely different things requiring opposite responses:

  • Valid empty result: Query executed successfully, no matching records exist. No retry needed.
  • Access failure: Query couldn't execute (timeout, permission error). Retry decision needed.
Task 5.4 — Manage context effectively in large codebase exploration

Context Degradation in Extended Sessions

Context degradation is a gradual failure mode: the model starts giving inconsistent answers, referencing "typical patterns" instead of specific classes discovered earlier in the session, or contradicting its own prior findings.

Strategies

  • Scratchpad files — agents maintain files recording key findings. Reference these files (not memory) for subsequent questions, so findings persist across context boundaries.
  • Subagent delegation — spawn Explore subagents to handle verbose exploration. Main agent coordinates at high level without filling its context with raw file contents.
  • Phase summarization — before spawning next-phase agents, summarize findings from the current phase and inject the summary into new agent prompts.
  • /compact — reduces context usage during extended exploration sessions when context fills with verbose discovery output.
Structured State Exports for Crash Recovery

Each agent exports its state to a known location. The coordinator maintains a manifest tracking which agents have completed which tasks. On restart, the coordinator loads the manifest and resumes from the last known state, injecting completed agent findings into new agent prompts.

Task 5.5 — Design human review workflows and confidence calibration

Human Review Routing

Aggregate Metrics Mask Segment Failures

A system with 97% overall accuracy may be 99.9% accurate on invoices but 58% accurate on handwritten forms. Automating based on overall accuracy removes human review from the failing segment. Always analyze accuracy by document type AND field segment before reducing human review.

Confidence Calibration

  • Field-level confidence scores (not document-level) allow granular routing — route only the specific fields with low confidence to human review, not the entire document
  • Calibrate thresholds using labeled validation sets — not the model's own self-reported confidence, which is poorly calibrated out of the box
  • Stratified random sampling of high-confidence extractions for ongoing error rate measurement and novel pattern detection
Task 5.6 — Preserve information provenance and handle uncertainty in multi-source synthesis

Information Provenance

Source attribution gets lost during summarization steps. Once claims are separated from their sources, attribution cannot be recovered. The solution: require structured claim-source mappings from subagents, and preserve them through all synthesis steps.

# Required subagent output structure
{
  "findings": [
    {
      "claim": "AI adoption in creative industries grew 34% in 2023",
      "source_url": "https://...",
      "document_name": "TechTrends2024.pdf",
      "page": 12,
      "excerpt": "Our survey of 500 creative professionals...",
      "publication_date": "2024-01-15"
    }
  ]
}

Handling Conflicting Sources

Annotate Conflicts, Don't Resolve Them

When two credible sources report different statistics, the synthesis agent should include BOTH values with their source attributions and explicitly note the conflict. The coordinator (or human) decides how to reconcile. Publication dates are often the explanation — what looks like a contradiction may be temporal data from different years.

Temporal Data

Require publication_date and data_collection_date in structured outputs. Without dates, a conflict between "AI adoption: 43%" (2019) and "AI adoption: 67%" (2023) looks like contradictory data but is actually consistent temporal data.