RAG Architecture Network Patterns
Retrieval-augmented generation looks simple in tutorials: embed a query, look it up in a vector database, paste the results into an LLM prompt, return the answer. In production, the same pattern unfolds into a multi-service architecture with several serial network hops, a retrieval stage whose latency matters, and an LLM call whose prefill cost scales with how much retrieved context you stuff into the prompt. The network shape of RAG determines both latency and cost.
The minimal RAG flow
- Client sends user query to application server.
- App server calls embedding API to embed the query (~50-200 ms).
- App server queries vector database with the embedding (~10-100 ms for indexed databases).
- App server constructs an LLM prompt containing system instructions + retrieved documents + user query.
- App server calls LLM API; receives streamed response.
- App server forwards the stream to the client.
Five distinct network operations, four of them on the critical path before any token reaches the user. Each hop is an opportunity for added latency, partial failure, or cost.
Where latency goes
| Step | Typical latency | Dominated by |
|---|---|---|
| Client → app server | 30-100 ms | Network RTT |
| Query embedding | 50-200 ms | Inference + RTT to embedding service |
| Vector search | 10-100 ms | Index latency + RTT to vector DB |
| Prompt construction | < 10 ms | App server CPU |
| LLM TTFT (with retrieval context) | 500ms-3s | Prefill of long context |
| LLM streaming TPOT | 10-50 ms per token × output length | Decode + network framing |
The LLM call is by far the largest single component. RAG queries with 2000-5000 retrieved tokens have prefill costs measured in hundreds of milliseconds to a few seconds.
Co-location and placement
Putting the embedding service, vector database, and application server in the same region (or even same VPC) cuts the inter-service RTT from tens of milliseconds to single-digit milliseconds. The LLM API is generally an external managed service whose location is fixed; you cannot move it closer.
For latency-critical RAG, choose an LLM provider with regional endpoints near your application and co-locate every other component.
Prompt caching and RAG
Prompt caching can dramatically reduce LLM cost and latency for RAG when:
- The system prompt is stable across calls.
- Retrieved documents are sometimes reused (popular queries, agent loops).
- The prompt prefix (system prompt + retrieved docs) is large relative to the user's specific query.
For workloads where retrieved context turns over completely each query, prompt caching helps less because the cached portion doesn't repeat. See prompt caching for the mechanics.
Vector database architecture choices
| Architecture | Pros | Cons |
|---|---|---|
| Managed cloud vector DB | No ops; scales horizontally | Network cost; outbound data egress; per-query pricing |
| Self-hosted (Qdrant, Weaviate, Milvus, etc.) | Full control; predictable cost | Ops overhead; capacity planning |
| Embedded library (FAISS, hnswlib) | Zero network hops | Single-process; no horizontal scaling; restart loses warm caches |
| Postgres + pgvector | Same DB as your relational data | Performance ceiling lower than dedicated stores at scale |
For small-to-medium corpora (under ~10M vectors), embedded or Postgres-based options can outperform dedicated services on latency because they remove a network hop entirely.
Hybrid retrieval: semantic + keyword
Pure vector search misses some queries that keyword search handles trivially (exact identifiers, code symbols, rare names). Hybrid retrieval combines:
- Dense retrieval — vector similarity (embedding-based).
- Sparse retrieval — keyword scoring (BM25 or TF-IDF).
- Reciprocal Rank Fusion or weighted combination to merge the two.
From the network shape perspective, hybrid retrieval often adds a second backend query (the keyword index) but the two can run in parallel, so the wall-clock cost is the slower of the two.
Reranking
A two-stage retrieval pattern: cheap first-pass retrieves top-K (say, 100 documents), then a more expensive reranker model scores those K documents against the query and picks the top N (say, 5) to send to the LLM. Reranking adds another network hop but improves precision substantially — RAG quality often improves more from better retrieval than from a stronger LLM.
Streaming a response while retrieving more
For complex queries, the LLM may decide it needs additional retrieval after generating part of the response (agentic RAG). The pattern:
- LLM streams response.
- LLM emits a tool-call indicating it needs to retrieve more.
- App server pauses the stream, performs retrieval, resumes the LLM with the new context.
- LLM continues streaming.
From a network perspective this looks like multiple LLM call rounds with retrieval between them. The user sees a streamed response that pauses briefly when retrieval happens. For more on the tool-calling shape see function calling network patterns.
Failure modes
- Embedding API failure — graceful degradation: cache last successful embedding, or fall back to keyword search.
- Vector DB unreachable — fall back to keyword search or a cached corpus.
- LLM timeout — retry with shorter context, or return a graceful "I need more time" response.
- Partial stream interruption — clients need to handle dropped streams and either restart or recover gracefully.
RAG systems have more failure points than direct LLM calls. Plan retries, timeouts, and fallback paths for each hop.
Frequently Asked Questions
What is RAG?
Retrieval-augmented generation is a pattern where an LLM's prompt is dynamically populated with relevant retrieved documents at query time. The retriever (typically a vector search over embedded documents) finds passages relevant to the user's question; the generator (the LLM) writes a response grounded in those passages. The combination gives the LLM access to information that wasn't in its training data without retraining.
What are the network hops in a RAG query?
A typical RAG query involves: client to application server, application server to embedding API (to embed the query), embedding API back, application server to vector database (to find similar documents), vector database back with top-k results, application server to LLM API (to generate the response with retrieved context), LLM API back as a stream. Five to seven serial network operations on the critical path.
Where is RAG latency concentrated?
Most of it is in the LLM generation step, especially prefill — which is now longer because the prompt includes retrieved documents. Embedding the query is fast (10-100 ms). Vector search is fast (often under 100 ms for indexed databases). The LLM call dominates everything else because retrieved context can be thousands of tokens of prefill before any output token is generated.
How does prompt caching help RAG?
When retrieved documents are reused across queries (popular questions, shared sessions), prompt caching can skip the prefill cost for the cached document context. Combined with a stable system prompt, the cached portion can be most of the input, leaving only the user's specific question to be prefilled fresh.
Should the vector database be near the LLM or near the application?
Near the application. The retrieval result (a few thousand tokens of text) travels from vector DB to application, then from application into the LLM prompt. Co-locating the vector DB and application server reduces the retrieval hop. The LLM API is typically a managed external service whose location is fixed and out of your control.
Related Guides
Embedding API Networking
The embedding stage that produces RAG's query and document vectors.
Prompt Caching
The biggest single cost lever in RAG workloads.
Function Calling Patterns
Multi-round LLM patterns that interact with RAG.
Context Windows and Token Budgets
How much retrieved context can you fit before quality degrades.
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.