Internals: What Frameworks Are Actually Doing¶

Every major multi-agent framework — LangChain's AgentExecutor, CrewAI's Crew.kickoff(), LangGraph's graph traversal, and the OpenAI Agents SDK's Runner.run() — wraps the same fundamental mechanics. This page strips the abstractions away to show you exactly what runs on the wire, how state moves between turns, and what the real costs of orchestration are.

1. The Agent Loop¶

The Core While-Loop¶

Every agent framework is a thin wrapper around the same structure:

LLM call → check for tool calls → execute tools → inject results → repeat

The loop terminates when: (a) the model produces a response with no tool calls, (b) a designated "finish" sentinel is triggered, or (c) a maximum iteration count is exceeded.

Here is the canonical minimal implementation using the raw OpenAI API — no framework, no magic:

agent_loop.py

import json
from openai import OpenAI

client = OpenAI()  # uses OPENAI_API_KEY env var

def agent_loop(user_message: str, max_iterations: int = 10) -> str:
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user",   "content": user_message},
    ]
    for _ in range(max_iterations):
        response = client.chat.completions.create(
            model="gpt-4o-mini", messages=messages, tools=tools
        )
        msg = response.choices[0].message
        messages.append(msg)                        # (1) record assistant turn

        if not msg.tool_calls:                      # (2) no tools → done
            return msg.content

        for tc in msg.tool_calls:                   # (3) execute every call
            fn   = available_functions[tc.function.name]
            args = json.loads(tc.function.arguments)
            result = fn(**args)
            messages.append({                       # (4) inject result
                "role": "tool",
                "tool_call_id": tc.id,
                "content": result,
            })
    return "Max iterations reached."

Source: DEV Community — Build an AI Agent Loop in 50 Lines of Python, Victor Dibia — The Agent Execution Loop

What each step does:

LLM call — send the full messages list to the API and receive a response.
Tool detection — inspect response.choices[0].message.tool_calls; if empty, the model has finished reasoning and produced a final answer.
Tool execution — dispatch each tool call to the corresponding Python function, parsing the JSON-encoded argument string.
Result injection — append a role: "tool" message back into messages so the model can see what the tool returned.
Repeat — the loop starts over from step 1 with the extended messages list.

The Key Insight

The entire conversation — system prompt, user query, all assistant messages with tool calls, all tool result messages — is replayed from scratch on every single API call. There is no server-side memory by default. Each call to client.chat.completions.create() is stateless; the framework feeds back the accumulating messages list every time. Source: Peter Roelants — ReAct + OpenAI Function Calling

What Frameworks Hide¶

The OpenAI Agents SDK Runner.run() adds session memory management (SQLiteSession), guardrail tripwires, max-turn enforcement, and lifecycle hooks — but the core loop is identical. Here is the full table of what each framework wraps:

Framework	What Wraps the Loop	Termination Signal
LangChain	`AgentExecutor`	No tool calls in response
CrewAI	`Crew.kickoff()`	Task `expected_output` produced
OpenAI Agents SDK	`Runner.run()`	`output_type` matched or no tool calls
LangGraph	Graph traversal + node execution	`END` node reached

Sources: DEV Community, Victor Dibia, Peter Roelants

2. Tool Calling Mechanics¶

What Actually Gets Sent Over the Wire¶

Request: the `tools` array entry¶

Every API call that involves tools includes a tools array. Each entry looks like this (OpenAI Function Calling Documentation):

Tool definition in request

{
  "type": "function",
  "name": "get_weather",
  "description": "Retrieves current weather for the given location.",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {
        "type": "string",
        "description": "City and country e.g. Bogotá, Colombia"
      },
      "units": {
        "type": "string",
        "enum": ["celsius", "fahrenheit"],
        "description": "Units the temperature will be returned in."
      }
    },
    "required": ["location", "units"],
    "additionalProperties": false
  },
  "strict": true
}

The strict: true flag (Structured Outputs mode) enforces that the model's output exactly matches the schema with no extra keys.

Response: the `function_call` object in `output`¶

When the model decides to call a tool, the response's output array contains (OpenAI Function Calling Documentation):

Function call in response output

[
  {
    "id": "fc_12345xyz",
    "call_id": "call_12345xyz",
    "type": "function_call",
    "name": "get_weather",
    "arguments": "{\"location\":\"Paris, France\",\"units\":\"celsius\"}"
  }
]

Common Bug

arguments is a JSON-encoded string, not a nested object. You must call json.loads(tool_call.function.arguments) before you can access the values. Forgetting this is one of the most common errors when working with the raw API.

Result injection: `function_call_output`¶

After executing the tool, the result is injected back as (OpenAI Function Calling Documentation):

Tool result injection payload

{
  "type": "function_call_output",
  "call_id": "call_12345xyz",
  "output": "{\"temperature\": 18, \"unit\": \"celsius\"}"
}

The call_id must match the call_id from the function call response, creating a paired request/response that the model uses to track which results correspond to which calls. For the older Chat Completions API (/v1/chat/completions), the equivalent is {"role": "tool", "tool_call_id": "call_123", "content": "..."}.

Python Function → JSON Schema Conversion¶

Frameworks use three main approaches to auto-generate JSON schemas from Python code, saving you from writing the tool definition JSON by hand.

Type hints + docstrings is the lightest approach. Libraries like annotated-docs use typing.Annotated for field descriptions and typing.Literal for enums. An as_json_schema(func) utility introspects the type annotations and the function's docstring to produce the schema. This works well for simple tools but requires discipline in keeping docstrings accurate.

Pydantic model_json_schema() is the most common production approach. Define your tool's parameters as a Pydantic BaseModel and call .model_json_schema() to get the full schema automatically (Pydantic JSON Schema Documentation):

Pydantic model → JSON schema

from pydantic import BaseModel
from typing import Literal

class GetCurrentWeather(BaseModel):
    """Get the current weather in a given location."""
    location: str
    unit: Literal["celsius", "fahrenheit"] | None = None

# Produces the full JSON schema automatically
schema = GetCurrentWeather.model_json_schema()

# OpenAI convenience wrapper:
tool_def = openai.pydantic_function_tool(GetCurrentWeather)

Pydantic handles nested objects, optional fields, enums, and field-level description annotations via Field(description="..."). The OpenAI Community recommends this approach for any tool with more than a couple of parameters.

LangChain @tool decorator uses the function's docstring as the description and inspects type hints to build the parameters schema automatically. The schema generation is compatible with OpenAI's expected format and the decorator also handles result serialization.

The `tool_choice` Parameter¶

Controls model behavior when tools are available (OpenAI Function Calling Documentation):

tool_choice options

"tool_choice": "auto"      // model decides whether to call a tool
"tool_choice": "required"  // must call at least one tool
"tool_choice": {"type": "function", "name": "get_weather"}  // force a specific tool

"required" is useful for structured extraction workflows where you always want the model to populate a schema. Forcing a specific tool is useful for single-purpose agents.

3. State and Memory Serialization¶

The three dominant patterns reflect fundamentally different answers to: "What does an agent need to remember between turns?"

(a) Full Message History Replay — OpenAI API / Agents SDK¶

Mechanism: Every API call replays the entire conversation history from scratch. The messages list is the state. There is no server-side memory (OpenAI Agents SDK Sessions Documentation).

Message history as state

messages = [
    {"role": "system",    "content": "..."},
    {"role": "user",      "content": "..."},                       # Turn 1
    {"role": "assistant", "content": "...", "tool_calls": [...]},  # Turn 1 response
    {"role": "tool",      "tool_call_id": "...", "content": "..."}, # Tool result
    {"role": "assistant", "content": "..."},                       # Turn 2 response
    {"role": "user",      "content": "..."},                       # Turn 2 follow-up
]
response = client.chat.completions.create(
    model="gpt-4o", messages=messages, tools=tools
)

The OpenAI Agents SDK builds on this with SQLiteSession and InMemorySession objects (both implementing SessionABC). When session=session is passed to Runner.run(), the SDK loads existing history, appends new turn items, and handles context trimming automatically via TrimmingSession (OpenAI Cookbook — Context Engineering with Session Memory).

Tradeoffs:

Factor	Assessment
Token cost	Grows linearly with conversation length — each turn pays for all prior tokens
Consistency	Perfect — model sees the entire history, no information loss
Implementation complexity	Low — just manage a Python list
Context window pressure	Hits limit after ~50–200 turns depending on tool output sizes
Cross-session persistence	Requires external storage (SQLite, Redis)

(b) Typed State Object — LangGraph `StateGraph`¶

Mechanism: State is a TypedDict (or Pydantic BaseModel) that flows through a directed graph. Each node receives the current state as input and returns a partial state update. Reducer functions (e.g., add_messages) merge updates with the existing state (LangGraph Persistence Documentation):

LangGraph typed state with reducer

from typing import Annotated, TypedDict
from langgraph.graph import StateGraph
from langgraph.graph.message import add_messages

class State(TypedDict):
    messages: Annotated[list, add_messages]  # reducer appends, never replaces
    data_to_save: dict                        # arbitrary non-message state

builder = StateGraph(State)
# Compile with a checkpointer for persistence
graph = builder.compile(checkpointer=AsyncPostgresSaver(conn))
config = {"configurable": {"thread_id": "user_123_session_456"}}
result = await graph.ainvoke(input_state, config=config)

LangGraph serializes state at every node transition using JsonPlusSerializer (which uses ormsgpack). Checkpoint backends include InMemorySaver, SqliteSaver, and AsyncPostgresSaver. Every checkpoint stores the full state object, the parent checkpoint ID (forming a tree that enables branching and time-travel), and the thread ID for conversation isolation.

Serialization Constraint

All values in the State TypedDict must be JSON-serializable (or handled by custom serializer extensions). Pydantic objects, LangChain AIMessage objects, and custom classes require special handling. This is a common source of cryptic errors in production — see LangGraph GitHub Issue #3441.

Tradeoffs:

Factor	Assessment
Token cost	Controllable — non-message data can be stored out-of-band
Consistency	Strong — state transitions are atomic via checkpointer
Implementation complexity	Higher — requires graph design, reducer functions, serializer configuration
Cross-session persistence	Built-in via checkpointer backends
Branching / replay	Supported via checkpoint tree ("time travel" debugging)
Failure recovery	Resume from last checkpoint on crash

(c) Role-Scoped Memory — CrewAI Unified Memory¶

Mechanism: CrewAI uses a unified Memory class backed by a LanceDB vector database (stored under ./.crewai/memory or $CREWAI_STORAGE_DIR/memory). Memory is LLM-driven on both save and recall (CrewAI Memory Documentation):

On save: The LLM analyzes content to infer scope (a hierarchical path like /agent/researcher), categories, and an importance score.
On recall: A composite scoring formula ranks candidates:

composite = semantic_weight × similarity
          + recency_weight  × decay
          + importance_weight × importance

where:
  similarity = 1 / (1 + vector_distance)      # 0–1
  decay      = 0.5^(age_days / half_life_days)  # exponential decay
  importance = record's importance score at encoding (0–1)

Before each task, the agent recalls relevant context from memory and injects it into the task prompt. After each task, the crew auto-extracts discrete facts from the task output via extract_memories() and stores them.

Two recall depths are available:

Depth	Latency	LLM Calls	Description
`shallow`	~200ms	None	Direct vector search + composite scoring
`deep` (default)	1–3s+	Yes	Multi-step: query analysis, scope selection, parallel search, recursive exploration

Tradeoffs:

Factor	Assessment
Token cost	Selective retrieval — only relevant memories injected; much lower per-turn cost vs. full replay
Consistency	Risk of recall failures — relevant context may not score high enough to surface
Implementation complexity	High — LLM used for both save and recall; LanceDB infrastructure required
Cross-session persistence	Native — LanceDB persists across runs
Scope isolation	Per-agent private scopes possible via `/agent/<name>` hierarchy

Comparison Table¶

Framework	Storage	Replay Strategy	Token Cost Growth	Consistency Risk
OpenAI API (raw)	In-memory list	Full replay every call	O(n) — linear	Low (full context)
OpenAI Agents SDK	SQLite / Memory session	Trimming + summarization	Bounded with trimming	Medium (trimmed context)
LangGraph	Postgres / SQLite checkpoints	State delta per node	Controlled (non-message state offloaded)	Low (atomic checkpoints)
CrewAI	LanceDB vector store	Semantic recall per task	O(1) per task (fixed recall limit)	Medium (recall may miss context)

Sources: OpenAI Agents SDK Sessions, LangGraph Persistence, CrewAI Memory

4. Handoffs Under the Hood¶

OpenAI Agents SDK¶

A handoff in the Agents SDK is a special tool call. The tool name defaults to transfer_to_<agent_name>, generated by Handoff.default_tool_name(). When the model calls this tool, the Runner transfers control to the specified agent (OpenAI Agents SDK Handoffs Documentation).

What crosses the boundary — HandoffInputData:

Field	Content
`input_history`	Full input history before `Runner.run()` started
`pre_handoff_items`	Items generated before the agent turn where handoff was invoked
`new_items`	Items from the current turn, including the handoff call and its output item
`input_items`	Optional override: items forwarded to next agent (for filtering)
`run_context`	Active `RunContextWrapper` at handoff time

By default, the receiving agent sees the entire conversation history including all prior tool calls, tool results, and assistant messages, unless an input_filter is provided.

Typed handoff with on_handoff callback

from pydantic import BaseModel
from agents import Agent, handoff, RunContextWrapper

class EscalationData(BaseModel):
    reason: str

async def on_handoff(ctx: RunContextWrapper[None], input_data: EscalationData):
    print(f"Escalation reason: {input_data.reason}")

handoff_obj = handoff(
    agent=escalation_agent,
    on_handoff=on_handoff,
    input_type=EscalationData,
)

The model generates {"reason": "duplicate_charge"} as the tool call arguments. The SDK validates this JSON, parses it into an EscalationData instance, and passes it to on_handoff. The input_type parameters are metadata attached to the handoff decision — they do not replace the next agent's main input.

Reducing Token Costs in Long Chains

Setting RunConfig.nest_handoff_history=True (opt-in beta) collapses the prior transcript into a single assistant summary message wrapped in a <CONVERSATION HISTORY> block, rather than passing the full verbatim history. This meaningfully reduces token costs in long multi-agent chains. Source: OpenAI Agents SDK Handoffs

LangGraph¶

LangGraph has two handoff mechanisms (LangGraph Handoffs How-To):

Conditional edges are used when routing logic depends only on the current graph state:

Static routing via conditional edges

builder.add_conditional_edges(
    "supervisor",
    should_continue,   # routing function → returns node name string
    path_map={
        "multiplication_expert": "multiplication_expert",
        "end": END
    }
)

Command objects combine routing and state mutation atomically — useful when the routing decision itself produces state changes:

Dynamic routing via Command

from langgraph.types import Command
from typing import Literal

def addition_expert(state: MessagesState) -> Command[Literal["multiplication_expert", "__end__"]]:
    ai_msg = model.bind_tools([transfer_to_multiplication_expert]).invoke(state["messages"])
    if ai_msg.tool_calls:
        tool_call_id = ai_msg.tool_calls[-1]["id"]
        tool_msg = {
            "role": "tool",
            "content": "Successfully transferred",
            "tool_call_id": tool_call_id,
        }
        return Command(
            goto="multiplication_expert",      # routing
            update={"messages": [ai_msg, tool_msg]}  # state mutation
        )
    return {"messages": [ai_msg]}

What crosses the handoff boundary: The shared State TypedDict. The receiving node receives the complete, mutated state as input — all messages accumulated up to the handoff point, merged by reducer functions. Use graph=Command.PARENT when a handoff originates inside a subgraph and must navigate to a sibling node in the parent graph.

CrewAI¶

CrewAI uses a different paradigm: agents don't hand off dynamically. The Crew's Process (sequential or hierarchical) defines the task execution order at configuration time. Task outputs are passed via TaskOutput objects (CrewAI Tasks Documentation):

TaskOutput chaining in sequential process

print(f"Task:   {task1.output.description}")
print(f"Output: {task1.output.raw}")

In sequential mode, each task's output is injected into the next task's context. In hierarchical mode, a manager agent delegates tasks to workers and aggregates results. The raw output text becomes part of the next agent's context window for its task.

Comparison Table¶

Framework	Handoff Trigger	Payload to Receiver	History Handling
OpenAI Agents SDK	`transfer_to_<agent>` tool call	Full `HandoffInputData` (history + new items)	Full replay by default; summarized with `nest_handoff_history=True`
LangGraph	`Command(goto=...)` return from node	Full shared `State` TypedDict	All messages merged via reducers
CrewAI	Task completion in sequential / hierarchical Process	`TaskOutput.raw` string	Context window injection per task

Sources: OpenAI Agents SDK Handoffs, LangGraph Handoffs, CrewAI Tasks

5. Why Different Philosophies Exist¶

These frameworks are not interchangeable with different UX skins. They encode fundamentally different engineering tradeoffs.

Graph-Based: LangGraph¶

Core tradeoff: Setup complexity in exchange for runtime predictability. Every agent action is a node; every routing decision is an explicit edge. State is a typed object that flows through the graph — nothing implicit, nothing emergent.

Why it exists: Complex production workflows require determinism, auditability, and failure recovery. A graph structure makes control flow visible, replayable, and debuggable. LangGraph adds native LangSmith integration, interrupt() for human-in-the-loop approvals at any node, checkpoint-based resume on crash, and visual graph rendering. The Command object allows combining routing and state mutation atomically — no equivalent exists in conversation-based frameworks (Langfuse — Comparing Open-Source AI Agent Frameworks).

Best for: Compliance-regulated workflows, long-running processes (days/weeks) that must survive crashes, document processing pipelines, multi-step research synthesis, any workflow requiring human approval at specific steps.

Not ideal for: Simple linear tasks (the graph setup overhead is pure waste), teams unfamiliar with directed graph concepts.

Conversational: AutoGen¶

Core tradeoff: Maximum flexibility and rapid prototyping speed, in exchange for less predictable control flow. Agents are participants in a structured conversation; flow emerges from agent interactions rather than explicit edges.

Why it exists: Iterative refinement maps naturally to a dialogue model — one agent writes code, another executes and critiques it, the first revises. AutoGen originated from Microsoft Research in 2023 specifically for code generation benchmarks. AutoGen v0.4 (2025) rebuilt the runtime on an actor model, enabling horizontal scaling across machines (Galileo — AutoGen vs CrewAI vs LangGraph vs OpenAI).

Best for: Code generation and execution loops, mathematical reasoning with iterative critique, rapid prototyping, customer-facing conversational applications.

Not ideal for: Structured workflows requiring deterministic branching, production systems requiring auditability of control flow.

Role-Based: CrewAI¶

Core tradeoff: Intuitive cognitive model and fast setup, in exchange for less fine-grained control. Agents are employees with defined roles, goals, and backstories; tasks are assigned based on role. The mental model maps naturally to team structures (DataCamp — CrewAI vs LangGraph vs AutoGen).

Why it exists: Many real-world workflows are team-shaped: a researcher gathers information, a writer drafts, an editor reviews. This intuitive model lowers the conceptual barrier for non-engineering stakeholders and produces prototypes ~40% faster than LangGraph by some benchmarks. The LanceDB-backed semantic memory system allows agents to accumulate knowledge across runs — something no other framework provides natively (Let's Data Science — AI Agent Frameworks 2026).

Best for: Content creation pipelines, customer support triage with specialist routing, structured team-like workflows with clear role boundaries.

Not ideal for: Complex conditional branching that doesn't fit a crew metaphor, workflows requiring deterministic replay or fine-grained execution control.

Lightweight Handoff: OpenAI Agents SDK¶

Core tradeoff: Minimal opinions and minimal setup, in exchange for building orchestration patterns yourself. The SDK provides Agent, handoff(), Runner, Session, and guardrails — but does not impose graph structure or conversation history management patterns. It is deliberately a set of building blocks (Langfuse — Comparing Open-Source AI Agent Frameworks).

Best for: Simple routing workflows, OpenAI-native apps leveraging built-in tools (web search, code interpreter, file search), teams that want to stay close to the raw API.

Framework Selection Matrix¶

Requirement	Best Framework	Reason
Complex branching workflows	LangGraph	Explicit edges, conditional routing, `Command` for atomic routing+mutation
Compliance / auditability	LangGraph	State transition logs, checkpoint replay, visual graph
Long-running workflows (days/weeks)	LangGraph	Checkpoint-based persistence and crash resume
Code generation / iterative refinement	AutoGen	Conversation-based write/execute/critique loops
Rapid prototyping	CrewAI or AutoGen	Low setup overhead
Content pipelines (researcher → writer → editor)	CrewAI	Role metaphor fits naturally; semantic memory
Cross-run persistent memory	CrewAI	Native LanceDB persistence
Customer support triage	LangGraph or OpenAI Agents SDK	Predictable routing, reliable handoffs
OpenAI-native integration	OpenAI Agents SDK	Official support, built-in tools

Sources: DataCamp, DEV Community 2026, Galileo AI

6. The Orchestration Tax¶

The "orchestration tax" is the cumulative penalty paid in latency, cost, and quality degradation when adding more agents to a pipeline. It has four components.

Latency Accumulation¶

Sequential agent chains accumulate latency multiplicatively. A single LLM call completes in roughly 800 milliseconds and achieves 60–70% accuracy on complex tasks. An orchestrator-worker flow with reflection loops can reach 95%+ accuracy, but extends latency to 10–30 seconds — a 12–37× increase (Parloa — Why Bad Agentic AI Latency Costs You Customers and Revenue).

Each agent invocation in a chain adds: one LLM call (800ms–3s), tool execution time (variable), and context assembly overhead. For user-facing applications, this is often the deciding factor against multi-agent architectures.

Error Propagation¶

The most important empirical finding on orchestration tax comes from arXiv:2603.04474 — Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Systems (March 2026). The paper seeded a single error into a 4-agent network and measured propagation over 5 rounds:

100% Final Infection Rate

In 5 out of 6 tested frameworks, a single seeded error reaches 100% final infection rate across the entire agent network. Minor inaccuracies "solidify into system-level false consensus through iteration and context reuse."

Framework	Topology	Agents	Final Error Rate
MetaGPT	Chain	4	100.0%
LangGraph	Star	4	100.0%
CrewAI	Star	4	100.0%
AutoGen	Mesh	3	100.0%
Camel	Mesh	3	100.0%
LangChain	Chain	4	89.2%

Hub impact factor: In LangGraph's star topology, hub infection = 100% while leaf infection = 9.7%, producing an impact factor of 10.31× — a hub node is 10× more likely to cause system-wide error than a leaf node. CrewAI's hub impact factor is 6.29× (arXiv:2603.04474).

Consensus inertia: An error introduced at t=6 requires 3.9× more rounds to correct than an error introduced at t=2, because prior agents have already built consensus around the false information.

Defense: A genealogy-based governance layer raises the defense success rate (BICR) from 0.32 (baseline reflection) to 0.89–0.94, at a cost of ~49% more tokens and ~2.1× longer latency in strict mode.

The 45% Rule: Diminishing Returns¶

Research reported in Towards Data Science establishes the saturation threshold:

Multi-agent coordination provides its highest returns when baseline single-agent performance is below 45%
If the base model already achieves 80%, adding agents may generate more noise than value
Performance increases initially but plateaus — often around 4 agents — after which additional agents contribute marginally

Poorly coordinated "bag of agents" networks can experience up to 17.2× error amplification. Centralized coordination limits amplification to ~4.4×.

Benchmark results vary dramatically by task type:

Benchmark	Task Type	Key Finding
Finance-Agent (2025)	Structured analyst reasoning	Centralized MAS outperforms single agent by +80.8%
BrowseComp-Plus (2025)	Web retrieval, multi-site	Decentralized +9.2%; Centralized +0.2%; Independent −35%
PlanCraft (2024)	Long-horizon Minecraft planning	All MAS variants: −39% to −70%
WorkBench (2024)	Business tool selection	Minimal movement: −11% to +6%

The key variable is task decomposability: MAS works when tasks can be genuinely parallelized. Sequential tasks see no benefit, only coordination overhead.

Cost Compounding¶

Total MAS cost = Work cost + Coordination cost (Towards Data Science — Why Your Multi-Agent System is Failing):

Work cost        = n × k × (input_tokens + output_tokens per step)
Coordination cost = (r × n + d × n² + p × n² × m) × token_cost

Where n = agents, k = max iterations per agent, r = orchestrator rounds, d = debate rounds, p = peer-communication rounds, m = average peer requests per round.

The n² term in decentralized topologies means coordination costs grow quadratically with agent count. A 4-agent decentralized debate can easily cost 8–16× more per query than a single agent.

In practice, multi-agent at ~$0.08/request vs. single-agent at ~$0.03/request — a 2.7× cost multiplier — has been documented even for relatively simple orchestration setups (YouTube — Stop Building Multi-Agent Systems).

Context Window Pressure¶

In full-history-replay architectures, every agent in a chain sees not just its own task context but the accumulated history from all prior agents. A 4-agent sequential chain where each agent produces 2,000 tokens of reasoning and tool results will have the final agent processing ~8,000 tokens of prior context before it starts its own task.

For gpt-4o at $5/1M input tokens, this is a material cost. It is also a real accuracy risk: LLMs degrade in long-context retrieval — the "lost in the middle" problem, where information in the middle of a long context window is retrieved less reliably than information at the beginning or end.

When Multi-Agent Is Worth It

Use multi-agent architectures when: (1) tasks are genuinely decomposable into parallel subtasks, (2) baseline single-agent accuracy is below ~45%, and (3) you can afford the 2–3× cost and 10–30× latency premium. For linear tasks, a single well-prompted agent with good tools almost always wins on cost, speed, and reliability.

Internals: What Frameworks Are Actually Doing¶

1. The Agent Loop¶

The Core While-Loop¶

What Frameworks Hide¶

2. Tool Calling Mechanics¶

What Actually Gets Sent Over the Wire¶

Request: the tools array entry¶

Response: the function_call object in output¶

Result injection: function_call_output¶

Python Function → JSON Schema Conversion¶

The tool_choice Parameter¶

3. State and Memory Serialization¶

(a) Full Message History Replay — OpenAI API / Agents SDK¶

(b) Typed State Object — LangGraph StateGraph¶

(c) Role-Scoped Memory — CrewAI Unified Memory¶

Comparison Table¶

4. Handoffs Under the Hood¶

OpenAI Agents SDK¶

LangGraph¶

CrewAI¶

Comparison Table¶

5. Why Different Philosophies Exist¶

Graph-Based: LangGraph¶

Conversational: AutoGen¶

Role-Based: CrewAI¶

Lightweight Handoff: OpenAI Agents SDK¶

Framework Selection Matrix¶

6. The Orchestration Tax¶

Latency Accumulation¶

Error Propagation¶

The 45% Rule: Diminishing Returns¶

Cost Compounding¶

Context Window Pressure¶

Request: the `tools` array entry¶

Response: the `function_call` object in `output`¶

Result injection: `function_call_output`¶

The `tool_choice` Parameter¶

(b) Typed State Object — LangGraph `StateGraph`¶