RAG Architecture Network Patterns

Q: What is RAG?

Retrieval-augmented generation is a pattern where an LLM's prompt is dynamically populated with relevant retrieved documents at query time. The retriever (typically a vector search over embedded documents) finds passages relevant to the user's question; the generator (the LLM) writes a response grounded in those passages. The combination gives the LLM access to information that wasn't in its training data without retraining.

Q: What are the network hops in a RAG query?

A typical RAG query involves: client to application server, application server to embedding API (to embed the query), embedding API back, application server to vector database (to find similar documents), vector database back with top-k results, application server to LLM API (to generate the response with retrieved context), LLM API back as a stream. Five to seven serial network operations on the critical path.

Q: Where is RAG latency concentrated?

Most of it is in the LLM generation step, especially prefill — which is now longer because the prompt includes retrieved documents. Embedding the query is fast (10-100 ms). Vector search is fast (often under 100 ms for indexed databases). The LLM call dominates everything else because retrieved context can be thousands of tokens of prefill before any output token is generated.

Q: How does prompt caching help RAG?

When retrieved documents are reused across queries (popular questions, shared sessions), prompt caching can skip the prefill cost for the cached document context. Combined with a stable system prompt, the cached portion can be most of the input, leaving only the user's specific question to be prefilled fresh.

Q: Should the vector database be near the LLM or near the application?

Near the application. The retrieval result (a few thousand tokens of text) travels from vector DB to application, then from application into the LLM prompt. Co-locating the vector DB and application server reduces the retrieval hop. The LLM API is typically a managed external service whose location is fixed and out of your control.

Retrieval-augmented generation looks simple in tutorials: embed a query, look it up in a vector database, paste the results into an LLM prompt, return the answer. In production, the same pattern unfolds into a multi-service architecture with several serial network hops, a retrieval stage whose latency matters, and an LLM call whose prefill cost scales with how much retrieved context you stuff into the prompt. The network shape of RAG determines both latency and cost.

The minimal RAG flow

Client sends user query to application server.
App server calls embedding API to embed the query (~50-200 ms).
App server queries vector database with the embedding (~10-100 ms for indexed databases).
App server constructs an LLM prompt containing system instructions + retrieved documents + user query.
App server calls LLM API; receives streamed response.
App server forwards the stream to the client.

Five distinct network operations, four of them on the critical path before any token reaches the user. Each hop is an opportunity for added latency, partial failure, or cost.

Where latency goes

Step	Typical latency	Dominated by
Client → app server	30-100 ms	Network RTT
Query embedding	50-200 ms	Inference + RTT to embedding service
Vector search	10-100 ms	Index latency + RTT to vector DB
Prompt construction	< 10 ms	App server CPU
LLM TTFT (with retrieval context)	500ms-3s	Prefill of long context
LLM streaming TPOT	10-50 ms per token × output length	Decode + network framing

The LLM call is by far the largest single component. RAG queries with 2000-5000 retrieved tokens have prefill costs measured in hundreds of milliseconds to a few seconds.

Co-location and placement

Putting the embedding service, vector database, and application server in the same region (or even same VPC) cuts the inter-service RTT from tens of milliseconds to single-digit milliseconds. The LLM API is generally an external managed service whose location is fixed; you cannot move it closer.

For latency-critical RAG, choose an LLM provider with regional endpoints near your application and co-locate every other component.

Prompt caching and RAG

Prompt caching can dramatically reduce LLM cost and latency for RAG when:

The system prompt is stable across calls.
Retrieved documents are sometimes reused (popular queries, agent loops).
The prompt prefix (system prompt + retrieved docs) is large relative to the user's specific query.

For workloads where retrieved context turns over completely each query, prompt caching helps less because the cached portion doesn't repeat. See prompt caching for the mechanics.

Vector database architecture choices

Architecture	Pros	Cons
Managed cloud vector DB	No ops; scales horizontally	Network cost; outbound data egress; per-query pricing
Self-hosted (Qdrant, Weaviate, Milvus, etc.)	Full control; predictable cost	Ops overhead; capacity planning
Embedded library (FAISS, hnswlib)	Zero network hops	Single-process; no horizontal scaling; restart loses warm caches
Postgres + pgvector	Same DB as your relational data	Performance ceiling lower than dedicated stores at scale

For small-to-medium corpora (under ~10M vectors), embedded or Postgres-based options can outperform dedicated services on latency because they remove a network hop entirely.

Hybrid retrieval: semantic + keyword

Pure vector search misses some queries that keyword search handles trivially (exact identifiers, code symbols, rare names). Hybrid retrieval combines:

Dense retrieval — vector similarity (embedding-based).
Sparse retrieval — keyword scoring (BM25 or TF-IDF).
Reciprocal Rank Fusion or weighted combination to merge the two.

From the network shape perspective, hybrid retrieval often adds a second backend query (the keyword index) but the two can run in parallel, so the wall-clock cost is the slower of the two.

Reranking

A two-stage retrieval pattern: cheap first-pass retrieves top-K (say, 100 documents), then a more expensive reranker model scores those K documents against the query and picks the top N (say, 5) to send to the LLM. Reranking adds another network hop but improves precision substantially — RAG quality often improves more from better retrieval than from a stronger LLM.

Streaming a response while retrieving more

For complex queries, the LLM may decide it needs additional retrieval after generating part of the response (agentic RAG). The pattern:

LLM streams response.
LLM emits a tool-call indicating it needs to retrieve more.
App server pauses the stream, performs retrieval, resumes the LLM with the new context.
LLM continues streaming.

From a network perspective this looks like multiple LLM call rounds with retrieval between them. The user sees a streamed response that pauses briefly when retrieval happens. For more on the tool-calling shape see function calling network patterns.

Failure modes

Embedding API failure — graceful degradation: cache last successful embedding, or fall back to keyword search.
Vector DB unreachable — fall back to keyword search or a cached corpus.
LLM timeout — retry with shorter context, or return a graceful "I need more time" response.
Partial stream interruption — clients need to handle dropped streams and either restart or recover gracefully.

RAG systems have more failure points than direct LLM calls. Plan retries, timeouts, and fallback paths for each hop.

Frequently Asked Questions

What is RAG?

Retrieval-augmented generation is a pattern where an LLM's prompt is dynamically populated with relevant retrieved documents at query time. The retriever (typically a vector search over embedded documents) finds passages relevant to the user's question; the generator (the LLM) writes a response grounded in those passages. The combination gives the LLM access to information that wasn't in its training data without retraining.

What are the network hops in a RAG query?

A typical RAG query involves: client to application server, application server to embedding API (to embed the query), embedding API back, application server to vector database (to find similar documents), vector database back with top-k results, application server to LLM API (to generate the response with retrieved context), LLM API back as a stream. Five to seven serial network operations on the critical path.

Where is RAG latency concentrated?

Most of it is in the LLM generation step, especially prefill — which is now longer because the prompt includes retrieved documents. Embedding the query is fast (10-100 ms). Vector search is fast (often under 100 ms for indexed databases). The LLM call dominates everything else because retrieved context can be thousands of tokens of prefill before any output token is generated.

How does prompt caching help RAG?

When retrieved documents are reused across queries (popular questions, shared sessions), prompt caching can skip the prefill cost for the cached document context. Combined with a stable system prompt, the cached portion can be most of the input, leaving only the user's specific question to be prefilled fresh.

Should the vector database be near the LLM or near the application?

Near the application. The retrieval result (a few thousand tokens of text) travels from vector DB to application, then from application into the LLM prompt. Co-locating the vector DB and application server reduces the retrieval hop. The LLM API is typically a managed external service whose location is fixed and out of your control.

Run a Speed Test

Related Guides

Embedding API Networking

The embedding stage that produces RAG's query and document vectors.

Prompt Caching

The biggest single cost lever in RAG workloads.

Function Calling Patterns

Multi-round LLM patterns that interact with RAG.

Context Windows and Token Budgets

How much retrieved context can you fit before quality degrades.