Inference Server Architecture

An LLM inference server takes incoming HTTP requests, packs them onto a fixed pool of GPUs, runs the model forward passes, and streams output tokens back over the network. Behind that simple description sits a scheduler, a batcher, a KV cache manager with paged memory, optional tensor and pipeline parallelism across multiple GPUs, and a streaming network layer. Modern open-source servers (vLLM, TGI, TensorRT-LLM, SGLang) all converge on a similar architecture; understanding the pieces is the foundation for capacity planning, performance tuning, and self-hosting.

The component layers

Layer	Role
Network frontend	HTTP/gRPC endpoint, auth, rate limiting, request parsing, streaming response
Scheduler	Decides which queued and active requests run on the next GPU iteration
Batcher	Packs the chosen requests into a single forward pass
KV cache manager	Allocates and reuses KV cache pages across requests
Inference engine	Runs the forward pass on GPU(s); applies attention optimizations
Tokenizer / detokenizer	Converts text to tokens on input, tokens to text on output
Output streamer	Sends each generated token back through the network layer to the client

The request lifecycle

Client sends a JSON HTTP request to the server.
Network frontend authenticates, parses, and hands the request to the scheduler queue.
Scheduler picks the next batch (including this new request) for the upcoming GPU iteration.
KV cache manager allocates pages for the new request's prefill.
Inference engine runs one forward pass: prefill for new requests, one decode step for in-flight requests.
Detokenizer converts the new tokens to text.
Output streamer pushes the text via SSE/WebSocket to the client.
Steps 3–7 repeat until the request finishes (max tokens, stop sequence, or end-of-stream token).
KV cache manager reclaims the request's pages for the next batch.

Paged attention in detail

Before paged attention, KV caches for each request were contiguous arrays sized to a worst-case max sequence length. If you allocated 32K per request but most requests were 2K, you wasted 30K per request — and with a batch of 32 requests, that's a lot of wasted GPU memory.

Paged attention treats the KV cache like virtual memory: allocate small fixed-size pages (e.g., 16 tokens each) on demand. A request that only needs 500 tokens of cache uses ~32 pages; a request that grows to 32K tokens uses ~2000. Pages are reusable instantly when a request finishes. Net effect: much higher effective batch sizes and throughput.

The scheduler's job

Each GPU iteration takes a fixed amount of time, regardless of how full the batch is. The scheduler's goal is to make every iteration as full as possible without violating memory limits. Strategies:

Continuous batching: always run; let new requests join an in-flight batch as soon as memory allows.
Chunked prefill: if a request has a 10K-token prompt and a decode batch has spare capacity, prefill a chunk of the prompt instead of doing the full 10K prefill in one shot. Keeps decode throughput steady.
Prioritization: long requests can starve short ones; SLA-aware schedulers cap how long any request waits in queue.
Preemption: rare but supported in some servers — pause a low-priority request to free memory for a high-priority one, then resume.

Multi-GPU layouts

A model that doesn't fit on one GPU has options:

Strategy	What it does	Communication
Tensor parallelism	Splits weight matrices across GPUs in the same layer	All-reduce after every attention and FFN; chatty
Pipeline parallelism	Different layers run on different GPUs	One transfer per layer boundary; less chatty but adds pipeline bubbles
Expert parallelism (MoE)	Different experts on different GPUs; routing across	All-to-all on routing decisions
Data parallelism	Same model on multiple GPUs serving different requests	None at inference; just load balancing

Most production servers use tensor parallelism within a node (NVLink between GPUs in the same machine) and data parallelism across nodes (different replicas serving different traffic). Pipeline parallelism is more common in training than inference.

Streaming on the wire

The output streamer typically uses SSE: each generated token (or group of tokens) becomes one SSE event. The client reads events as they arrive. The server-side concern is backpressure — if the client is slow to read, the server's send buffer fills, and the inference engine eventually has to pause. Mature servers handle this by either dropping the request after a timeout or by keeping a bounded buffer per stream and skipping behind.

Memory bookkeeping

GPU memory contains:

Model weights (largest single allocation; fixed per replica).
KV cache pages (variable; grows with active requests).
Activation buffers (small but non-zero).
Workspace memory for kernels (CUDA scratch space).

The KV cache budget is whatever's left after weights and overhead. For a 70B model on an 80 GB GPU, weights take ~40 GB (FP16), leaving ~35 GB for KV. That bounds the maximum sum of (active request lengths). Long-context workloads consume KV budget fast.

Self-hosted vs managed

Managed inference APIs (Anthropic, OpenAI, etc.) hide the inference server architecture behind a simple HTTP interface. Self-hosting means running an inference server (vLLM, TGI, etc.) yourself on your own GPU hardware. The tradeoff:

Cost: at high volume, self-hosted is significantly cheaper. At low volume, managed wins because you don't pay for idle GPUs.
Latency: self-hosted can be lower if you co-locate near your users. Managed APIs add inter-region network overhead.
Operations: self-hosted requires capacity planning, monitoring, deployment automation, and on-call. Managed eliminates that burden.
Model flexibility: self-hosted lets you run any open-weights model. Managed limits you to the provider's catalog.

See self-hosted LLM inference networking for the network-side detail of running your own.

Frequently Asked Questions

What does an inference server do?

It accepts inference requests over the network, queues them, schedules them onto GPU batches, manages KV cache memory, runs the model forward passes, and streams output tokens back to clients. The architecture has to balance latency (each request should respond quickly), throughput (GPU should stay near full utilization), and memory (KV caches must fit).

What is paged attention?

Paged attention manages KV cache memory in small fixed-size pages rather than allocating a single contiguous block per request. This eliminates internal fragmentation that wastes memory when request lengths vary widely, and allows pages from completed requests to be reused immediately. It is the technique behind vLLM's high throughput and is now standard across modern inference servers.

How does the scheduler decide what to batch?

At each iteration, the scheduler looks at currently active requests (those mid-decode) and queued new requests (those waiting to start prefill). It picks a set that fits in GPU memory and runs them together. Modern schedulers can mix prefill and decode in the same iteration (chunked prefill), which keeps the GPU busy with whatever work is available.

What is tensor parallelism?

Tensor parallelism splits a model's weight matrices across multiple GPUs so that one inference forward pass uses all of them. Each GPU computes its slice and exchanges intermediate values with the others. Used when a model is too large for a single GPU's memory; the cost is inter-GPU communication on every layer.

How does the network layer work?

The network layer sits in front of the inference engine. It accepts HTTP or gRPC requests, parses the input, hands the request to the scheduler, and streams output back as the engine produces tokens. SSE is the most common streaming protocol for client-facing APIs. The network layer also handles authentication, rate limiting, request logging, and protocol translation between user-facing APIs and internal inference engine interfaces.

Run a Speed Test

Related Guides

Batching vs Streaming

The throughput-latency tradeoffs the scheduler manages.

Inference Latency Components

The prefill/decode model the server is built around.

Self-Hosted Inference Networking

Running this stack on your own hardware.

Edge vs Cloud Inference

Where the inference server actually lives.