Function Calling Network Patterns
A single-shot LLM call has a simple network shape: one request out, one streamed response back. Function calling turns that into a multi-round conversation between the LLM, the client, and any number of external tools. Each tool invocation is an extra network round trip — to the LLM, then to the tool, then back to the LLM. The throughput and latency profile of an agentic application is determined less by individual call cost and more by how many of these round trips end up on the critical path.
The single-tool round trip
- Client → LLM: user message + tool definitions.
- LLM → Client: "call tool X with arguments Y." No user-visible text.
- Client executes tool X locally or against an external service.
- Client → LLM: full conversation + tool result.
- LLM → Client: final user-visible response (streamed).
Two LLM round trips, one tool execution. Total user-visible TTFT is the sum: LLM prefill + tool time + LLM prefill again + first decode token.
Multi-tool sequential workflow
Some tasks require chained tools — search, then summarize, then translate, then format. Each step depends on the previous. The pattern:
LLM → tool_1 → LLM → tool_2 → LLM → tool_3 → LLM → final response
Four LLM round trips, three tool executions. If each LLM call takes ~1s of TTFT and each tool ~200 ms, that's ~5 seconds before the user sees text. Streaming the final response helps perceived latency but not total time.
Parallel tool calls
When tools are independent (look up the weather and the stock price), the LLM can emit multiple tool calls in a single response. The client executes them in parallel:
LLM → [tool_1, tool_2, tool_3] simultaneously → LLM → response
Three independent tool calls collapse from 4 LLM round trips to 2. The wall-clock saving is large — often the difference between a snappy agent and a sluggish one.
Conversation history grows every round
Each LLM round trip sends the entire conversation history as prefill input. That includes:
- The original system prompt.
- The user message.
- Every prior tool call (request + result).
- The model's intermediate reasoning, if any.
By the third or fourth round, prefill is dominated by prior tool calls and results. Without prompt caching, this is a linear-in-history cost; with prompt caching, the cached portion grows after each round and each new turn is roughly constant-cost.
Streaming tool calls
When the LLM emits a tool call, the API can stream the tool name and argument JSON as the model produces it. The client typically waits for the complete tool call before executing — partial arguments are not actionable. Streaming during the tool-call portion is mostly cosmetic; the visible user-facing pause is from waiting for the tool to execute, then for the next LLM call's prefill.
Some applications show a UI indicator ("Looking up...") during the tool call to give the user feedback. The implementation reads the stream, detects the start of a tool call, shows the indicator, and resumes the visible text once the next streaming-response cycle begins.
Failure modes
- Tool returns an error. Client passes the error back to the LLM, which decides whether to retry, ask for clarification, or give up. Multiple retry rounds inflate latency.
- Tool times out. Client cancels and reports timeout. LLM typically apologizes; total time was wasted.
- LLM emits malformed tool call. Client cannot execute. Strict JSON-schema validation by the LLM API helps; otherwise the client must reject and re-prompt.
- LLM cycles between tools. Without explicit limits, an agentic loop can call tools indefinitely. Always set a maximum round count.
Tool execution placement
Tools can run anywhere:
- In-process (the client itself executes). Fastest, no extra network hop.
- Internal microservice. Adds one RTT but stays in your own infrastructure.
- External API. Adds RTT to the external service plus that service's response time.
- User confirmation required. Latency is dominated by user response time, not network.
Critical-path latency depends on where the slowest tool sits. For latency-sensitive agents, in-process or co-located tools are essential.
Prompt caching in tool-use workflows
Tool-use conversations are unusually friendly to prompt caching because most of the history is stable between rounds. Each new round adds one tool result to the end; everything before is identical. With caching enabled, the cached portion grows each round and only the new tool result is freshly prefilled. Cost per round becomes roughly constant instead of linear in history length.
Designing for fewer rounds
The biggest latency wins come from reducing round count:
- Give the LLM access to richer tools that do more per call (e.g., a search-and-summarize tool instead of separate search + summarize calls).
- Encourage parallel tool calls through prompt design and tool naming.
- Pre-fetch obviously-needed tool results before the first LLM call.
- Cache deterministic tool results across conversations.
Frequently Asked Questions
What is function calling in an LLM API?
Function calling (also called tool use) is a mechanism where the LLM, instead of answering directly, emits a structured request to call an external function with specific arguments. The client receives that request, executes the function, and sends the result back to the LLM in a follow-up message. The LLM then continues generating with the new information. It is the foundation of agentic and integrated LLM applications.
How many network round trips does a tool call take?
Each tool call involves two LLM round trips at minimum: the first call returns a function-call request, then a second call delivers the function result and resumes generation. For multi-tool agentic workflows, this multiplies: N tool calls means N+1 LLM round trips on the critical path. Latency scales linearly with the number of sequential tools.
Can tool calls be parallel?
Yes. Modern LLM APIs let the model emit multiple tool calls in one response. The client executes them in parallel, collects the results, and returns them all in the next LLM message. This collapses what would be N sequential round trips into one — a major latency win when the tool calls are independent.
How does streaming interact with tool calls?
Streaming continues to work, but when the LLM emits a tool call it usually emits the entire call as a structured block, not token-by-token user-facing text. The application detects the tool-call event in the stream, suspends user-visible streaming, performs the call, and resumes streaming after sending the result back. From the user's perspective, the response may show a brief pause while a tool is being called.
What is the cost impact of multi-round tool use?
Each round counts the full conversation history (including prior tool calls and results) as prefill input. Token usage grows quickly with conversation length. Prompt caching helps because the conversation history is mostly identical between rounds. Without caching, a 5-round tool-use conversation can cost 3-5x what a single-round equivalent would, even though the user sees one logical interaction.
Related Guides
RAG Architecture
Retrieval is the most common tool the LLM calls.
Prompt Caching
The single biggest cost optimization for tool-use conversations.
Context Windows and Token Budgets
Why multi-round workflows hit context limits faster than chat.
LLM API Latency
The per-call cost that gets multiplied by round count.
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.