Inference Server Architecture
An LLM inference server takes incoming HTTP requests, packs them onto a fixed pool of GPUs, runs the model forward passes, and streams output tokens back over the network. Behind that simple description sits a scheduler, a batcher, a KV cache manager with paged memory, optional tensor and pipeline parallelism across multiple GPUs, and a streaming network layer. Modern open-source servers (vLLM, TGI, TensorRT-LLM, SGLang) all converge on a similar architecture; understanding the pieces is the foundation for capacity planning, performance tuning, and self-hosting.
The component layers
| Layer | Role |
|---|---|
| Network frontend | HTTP/gRPC endpoint, auth, rate limiting, request parsing, streaming response |
| Scheduler | Decides which queued and active requests run on the next GPU iteration |
| Batcher | Packs the chosen requests into a single forward pass |
| KV cache manager | Allocates and reuses KV cache pages across requests |
| Inference engine | Runs the forward pass on GPU(s); applies attention optimizations |
| Tokenizer / detokenizer | Converts text to tokens on input, tokens to text on output |
| Output streamer | Sends each generated token back through the network layer to the client |
The request lifecycle
- Client sends a JSON HTTP request to the server.
- Network frontend authenticates, parses, and hands the request to the scheduler queue.
- Scheduler picks the next batch (including this new request) for the upcoming GPU iteration.
- KV cache manager allocates pages for the new request's prefill.
- Inference engine runs one forward pass: prefill for new requests, one decode step for in-flight requests.
- Detokenizer converts the new tokens to text.
- Output streamer pushes the text via SSE/WebSocket to the client.
- Steps 3–7 repeat until the request finishes (max tokens, stop sequence, or end-of-stream token).
- KV cache manager reclaims the request's pages for the next batch.
Paged attention in detail
Before paged attention, KV caches for each request were contiguous arrays sized to a worst-case max sequence length. If you allocated 32K per request but most requests were 2K, you wasted 30K per request — and with a batch of 32 requests, that's a lot of wasted GPU memory.
Paged attention treats the KV cache like virtual memory: allocate small fixed-size pages (e.g., 16 tokens each) on demand. A request that only needs 500 tokens of cache uses ~32 pages; a request that grows to 32K tokens uses ~2000. Pages are reusable instantly when a request finishes. Net effect: much higher effective batch sizes and throughput.
The scheduler's job
Each GPU iteration takes a fixed amount of time, regardless of how full the batch is. The scheduler's goal is to make every iteration as full as possible without violating memory limits. Strategies:
- Continuous batching: always run; let new requests join an in-flight batch as soon as memory allows.
- Chunked prefill: if a request has a 10K-token prompt and a decode batch has spare capacity, prefill a chunk of the prompt instead of doing the full 10K prefill in one shot. Keeps decode throughput steady.
- Prioritization: long requests can starve short ones; SLA-aware schedulers cap how long any request waits in queue.
- Preemption: rare but supported in some servers — pause a low-priority request to free memory for a high-priority one, then resume.
Multi-GPU layouts
A model that doesn't fit on one GPU has options:
| Strategy | What it does | Communication |
|---|---|---|
| Tensor parallelism | Splits weight matrices across GPUs in the same layer | All-reduce after every attention and FFN; chatty |
| Pipeline parallelism | Different layers run on different GPUs | One transfer per layer boundary; less chatty but adds pipeline bubbles |
| Expert parallelism (MoE) | Different experts on different GPUs; routing across | All-to-all on routing decisions |
| Data parallelism | Same model on multiple GPUs serving different requests | None at inference; just load balancing |
Most production servers use tensor parallelism within a node (NVLink between GPUs in the same machine) and data parallelism across nodes (different replicas serving different traffic). Pipeline parallelism is more common in training than inference.
Streaming on the wire
The output streamer typically uses SSE: each generated token (or group of tokens) becomes one SSE event. The client reads events as they arrive. The server-side concern is backpressure — if the client is slow to read, the server's send buffer fills, and the inference engine eventually has to pause. Mature servers handle this by either dropping the request after a timeout or by keeping a bounded buffer per stream and skipping behind.
Memory bookkeeping
GPU memory contains:
- Model weights (largest single allocation; fixed per replica).
- KV cache pages (variable; grows with active requests).
- Activation buffers (small but non-zero).
- Workspace memory for kernels (CUDA scratch space).
The KV cache budget is whatever's left after weights and overhead. For a 70B model on an 80 GB GPU, weights take ~40 GB (FP16), leaving ~35 GB for KV. That bounds the maximum sum of (active request lengths). Long-context workloads consume KV budget fast.
Self-hosted vs managed
Managed inference APIs (Anthropic, OpenAI, etc.) hide the inference server architecture behind a simple HTTP interface. Self-hosting means running an inference server (vLLM, TGI, etc.) yourself on your own GPU hardware. The tradeoff:
- Cost: at high volume, self-hosted is significantly cheaper. At low volume, managed wins because you don't pay for idle GPUs.
- Latency: self-hosted can be lower if you co-locate near your users. Managed APIs add inter-region network overhead.
- Operations: self-hosted requires capacity planning, monitoring, deployment automation, and on-call. Managed eliminates that burden.
- Model flexibility: self-hosted lets you run any open-weights model. Managed limits you to the provider's catalog.
See self-hosted LLM inference networking for the network-side detail of running your own.
Frequently Asked Questions
What does an inference server do?
It accepts inference requests over the network, queues them, schedules them onto GPU batches, manages KV cache memory, runs the model forward passes, and streams output tokens back to clients. The architecture has to balance latency (each request should respond quickly), throughput (GPU should stay near full utilization), and memory (KV caches must fit).
What is paged attention?
Paged attention manages KV cache memory in small fixed-size pages rather than allocating a single contiguous block per request. This eliminates internal fragmentation that wastes memory when request lengths vary widely, and allows pages from completed requests to be reused immediately. It is the technique behind vLLM's high throughput and is now standard across modern inference servers.
How does the scheduler decide what to batch?
At each iteration, the scheduler looks at currently active requests (those mid-decode) and queued new requests (those waiting to start prefill). It picks a set that fits in GPU memory and runs them together. Modern schedulers can mix prefill and decode in the same iteration (chunked prefill), which keeps the GPU busy with whatever work is available.
What is tensor parallelism?
Tensor parallelism splits a model's weight matrices across multiple GPUs so that one inference forward pass uses all of them. Each GPU computes its slice and exchanges intermediate values with the others. Used when a model is too large for a single GPU's memory; the cost is inter-GPU communication on every layer.
How does the network layer work?
The network layer sits in front of the inference engine. It accepts HTTP or gRPC requests, parses the input, hands the request to the scheduler, and streams output back as the engine produces tokens. SSE is the most common streaming protocol for client-facing APIs. The network layer also handles authentication, rate limiting, request logging, and protocol translation between user-facing APIs and internal inference engine interfaces.
Related Guides
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.