Self-Hosted LLM Inference Networking

Running your own LLM inference removes provider rate limits and per-token costs, but introduces a different set of constraints: GPU memory budgets, batch scheduler design, and the network architecture that connects clients to GPUs without becoming the bottleneck. The standard 2026 stack is vLLM or Hugging Face TGI behind a load balancer, with NVLink-connected GPU pods serving 7B-70B parameter models. This guide explains how the inference server actually works internally, what determines throughput, and how to design the network around it.

What an inference server actually does

Production inference servers (vLLM, TGI, TensorRT-LLM, SGLang) all implement roughly the same architecture:

  1. HTTP API. Typically OpenAI-compatible endpoints. Handles auth, parameter validation, formatting.
  2. Request scheduler. Maintains a pool of active sequences and decides which to advance at each decode step.
  3. Batched executor. Submits batched forward passes to the GPU(s).
  4. KV cache manager. Tracks per-sequence attention state in GPU memory.
  5. Streaming output. Pushes generated tokens to the HTTP layer as SSE events.

The big advance over naive batching (running fixed-size batches to completion) is continuous batching — adding and removing sequences from the running batch on every step. This is the single biggest throughput optimization in modern inference servers.

Continuous batching: why throughput scales

Without continuous batching, an inference server runs a batch until every sequence in it completes. If you batch 8 requests and 7 of them generate 100 tokens but one generates 2000 tokens, the GPU is underutilized while waiting for the long one — only 1 sequence is active for most of the time.

Continuous batching evicts completed sequences and admits new ones at every decode step. The batch composition changes continuously as work flows through. When a short sequence finishes after 50 tokens, a fresh request takes its place immediately. GPU utilization stays near maximum.

Throughput gain vs static batching: typically 2-10x depending on workload variance. For mixed workloads (chat, completion, agent loops with varying lengths), continuous batching is essential.

The KV cache and why it dominates memory budgets

During autoregressive generation, the model computes attention from each new token against all previous tokens. Recomputing attention against all previous tokens for every step would be quadratic. Instead, the model caches per-token key and value tensors (the KV cache) and only computes attention against the cached tensors for new steps.

This makes generation linear in time but quadratic in memory: each active sequence needs KV cache for every token it has accumulated, across every layer. For Llama 3 70B at 16-bit precision:

per_token_kv_size = 2 (K + V) × hidden_size × num_layers × 2 bytes
                  ≈ 2 × 8192 × 80 × 2
                  ≈ 2.6 MB per token... but per attention head, summed:
                  ≈ 320 KB per token in total across all layers

50K tokens × 320 KB = 16 GB just for one sequence

An H100 with 80 GB VRAM holds the 140 GB model weights (with quantization) plus this KV cache. For 70B at FP8, you can fit perhaps 30-50 concurrent sequences at moderate context length, or 1-2 sequences at the maximum 200K context length. The KV cache dominates the memory budget once the model fits.

Paged attention: how vLLM increased throughput 5x

Classical KV cache implementations allocate a contiguous block of memory per sequence sized for its maximum possible context length. Result: massive fragmentation. A sequence generating 100 tokens with a 200K max reserved is wasting 99.95% of its allocation. Across many sequences, this fragmentation typically leaves 60-80% of KV-cache memory unused.

Paged attention (introduced by vLLM in 2023) borrows the OS virtual memory pattern. KV cache is divided into fixed-size blocks (typically 16 tokens each). Sequences allocate blocks on demand as they grow. A page table maps logical token positions to physical blocks. When a sequence completes, its blocks are returned to the free pool and reallocated to other sequences.

Result: 5-10x more concurrent sequences in the same VRAM, with minimal overhead from the page table indirection.

Why memory bandwidth, not compute, is the bottleneck

Modern GPUs (H100, MI300X) have far more compute than memory bandwidth can feed. For autoregressive decode, the bottleneck is reading the model weights from HBM for every step. A 70B model at 16-bit precision is 140 GB; reading that once per token means:

tokens_per_second ≈ memory_bandwidth / model_size
                  ≈ 3,000 GB/s / 140 GB
                  ≈ 21 tokens/second per sequence

That is the theoretical ceiling on H100 for 70B at FP16, single-sequence. Batching amortizes the weight transfer across multiple sequences — the weights are loaded once per step and used for all sequences in the batch. With 32 sequences in the batch, throughput becomes 32x21 = 672 tokens/second, but the per-sequence rate is still ~21 tokens/second.

This is why specialty hardware (Groq, Cerebras, SambaNova) achieves much higher TPOT: they hold the model in SRAM rather than HBM, eliminating the memory transfer bottleneck. The trade-off is dramatically lower capacity per chip.

vLLM vs TGI vs other inference servers

ServerStrengthWeakness
vLLMHighest throughput; original paged attention; broad model supportOperational complexity; rapid version churn
Hugging Face TGIProduction-tested; good observability; Hugging Face integrationSlightly lower throughput than vLLM in 2026
TensorRT-LLMHighest performance on NVIDIA hardware; FP8 supportNVIDIA-only; complex to optimize per model
SGLangBest for complex agent flows; RadixAttention for prefix sharingNewer; smaller community
llama.cpp / OllamaCPU and consumer GPU; easy to runLower throughput; not built for multi-tenant
LMDeployOptimized for Chinese-trained models (Qwen, DeepSeek)Smaller ecosystem outside Chinese AI community

Default choice for production: vLLM if you want highest throughput; TGI if you want production-tested stability with good logs. TensorRT-LLM if you have engineering capacity to tune per model.

The network architecture in front of inference

A typical production setup:

Clients
   ↓
Load Balancer (L7, with WebSocket / SSE support)
   ↓
Auth / rate-limit layer
   ↓
Inference router (model-aware routing)
   ↓ ↓ ↓ ↓ ↓
[Inference pod] [Inference pod] [Inference pod] ...
(each: vLLM + 1-8 GPUs)

Key design considerations:

Load balancer must support streaming

The LB sits between clients and inference servers, forwarding SSE streams. Configure it to:

  • Disable response buffering (proxy_buffering off in nginx).
  • Set generous idle timeouts (300s+) since long generations may pause.
  • Use HTTP/2 between LB and backend for multiplexed connections.

Sticky routing for multi-turn conversations

If your inference servers cache KV state (vLLM's prefix caching, SGLang's RadixAttention), routing the same conversation back to the same server preserves cache hits. Use cookie-based or hash-based sticky routing.

Health checks must verify GPU readiness

A simple HTTP /health endpoint that returns 200 is not enough — the GPU may be OOM or the model may have crashed mid-load. Health checks should verify the inference engine is responding to a small test prompt.

Graceful shutdown

When draining an inference pod for updates, the LB must stop sending new requests while existing streams complete. Without proper drain handling, in-flight streams get cut mid-response.

Tensor parallelism vs pipeline parallelism vs data parallelism

Large models (70B+) typically do not fit on a single GPU. Three parallelism strategies, often combined:

Tensor parallelism (TP)

Split each layer's matrices across GPUs. Every layer requires an all-reduce communication between GPUs after the matrix multiply. Requires fast inter-GPU bandwidth — NVLink intra-node, InfiniBand inter-node. Latency-friendly: each token still processes through one node.

Pipeline parallelism (PP)

Split the model by layer across GPUs (or nodes). Layer 1-20 on node A, 21-40 on node B, etc. Each token passes through the pipeline sequentially. Less inter-GPU bandwidth needed but adds latency (one pipeline hop per layer group) and requires careful microbatch scheduling to keep all stages busy.

Data parallelism (DP)

Run multiple copies of the model on independent GPUs/nodes. Each handles different requests. No communication required between replicas (other than initial weight loading). Best for throughput when the model fits on the available GPUs.

Typical 70B production setup: TP-8 within a node (one model spans 8 GPUs with NVLink) + DP across many such nodes. Pipeline parallelism rarely used for inference; primarily a training optimization.

Prefix caching: the network design implication

vLLM and SGLang both support prefix caching — the inference server retains KV cache for previously-seen prompt prefixes and reuses them across requests. This is the self-hosted version of Anthropic and OpenAI's prompt caching.

For this to work, requests with the same prefix must land on the same inference server (otherwise the cache hit is lost). Routing strategies:

  • Hash-based: Hash a stable prefix identifier (system prompt hash, tenant ID) and route deterministically.
  • Sticky session: Use a session cookie set on first request.
  • RadixAttention with cross-server replication: SGLang's approach. Cache state is replicated across servers, so any server can serve any prefix.

The trade-off: hash-based routing risks load imbalance if some prefixes are much more popular than others; cross-server replication has higher infrastructure cost but better load distribution.

When self-hosting is the right call

Self-hosted inference makes economic sense when:

  • Sustained high throughput. Roughly >5M tokens/hour. Below that, the fixed cost of GPU rental dominates.
  • Custom or open-weight model. Provider APIs only host their own models. To run Llama, Mistral, Qwen, DeepSeek, or your own fine-tuned model, self-hosting is the only option.
  • Compliance constraints. Data residency or regulated industries that cannot send data to third-party providers.
  • Predictable latency requirements. Provider APIs have variable latency under load; self-hosted gives you predictable performance subject to your own capacity.

Self-hosted is the wrong call when traffic is bursty (you pay for idle GPUs), small (provider API per-token cost is cheaper than fixed GPU cost), or requires top-tier frontier models that are not available open-weight.

Frequently Asked Questions

What is continuous batching and why does it matter?

Continuous batching adds and removes sequences from the running batch at every decode step, rather than running fixed batches to completion. When one sequence finishes, a new request fills its slot immediately. This dramatically improves GPU utilization compared to static batching, where short and long requests in the same batch waste GPU on the short ones once they finish. vLLM and TGI both implement continuous batching; throughput typically improves 2-10x over static batching.

How much KV cache memory does a model need?

KV cache memory scales with concurrent sequences × per-sequence context length × hidden size × number of layers × 2 (K and V) × bytes per parameter. For Llama 3 70B at 16-bit precision, each token of KV cache is roughly 320 KB across all layers. A 200K context window for one sequence is 64 GB — more than a single H100 GPU has. This is why paged attention exists and why batch size + context length combinations are tightly bounded by VRAM.

What is paged attention?

Paged attention (introduced by vLLM) splits the KV cache into fixed-size blocks managed like operating-system memory pages. Sequences allocate blocks dynamically as they grow rather than reserving worst-case contiguous memory upfront. This eliminates the 60-80% memory fragmentation common in classical implementations, increasing the number of sequences that fit in a given amount of VRAM. The trade-off is added bookkeeping overhead, which is negligible compared to the throughput gain.

Why are inference servers memory-bandwidth bound rather than compute bound?

Autoregressive decode generates one token at a time. Each step reads the entire model weights from GPU memory to compute the next token. For a 70B model at 16-bit precision, that is 140 GB of weights read per token. Even an H100 with 3 TB/s memory bandwidth tops out around 20 tokens/second per sequence for that workload — the GPU's compute units sit idle while waiting for memory. Batching helps amortize the memory transfer across multiple sequences.

Do I need NVLink or InfiniBand for multi-GPU inference?

For tensor parallelism within a single node — yes, NVLink (or InfiniBand for cross-node) is critical. Tensor parallelism splits each layer across GPUs and requires fast all-reduce communication every layer; PCIe 5.0 is too slow to be practical for 70B+ models. For pipeline parallelism (splitting the model by layer) or data parallelism (different requests on different GPUs), inter-GPU bandwidth matters less but still benefits from NVLink. Production deployments of large models almost always use 8x H100 with NVLink as the unit.

Related Guides

More From This Section