Streaming LLM Responses: SSE vs WebSocket

Every major LLM API offers a streaming endpoint and most use it by default. The choice of transport matters less than people think for hosted APIs — providers already picked Server-Sent Events (SSE) and you adapt — but it matters a great deal if you are building a self-hosted gateway, a multi-tenant LLM proxy, or any service that aggregates LLM streams. This guide covers the three viable transports (SSE, WebSocket, HTTP long polling), why SSE has become the default, and the engineering pitfalls that show up at scale.

The three streaming transports

TransportDirectionUnderlyingBest for
Server-Sent Events (SSE)Server → client onlyHTTP/1.1, /2, /3LLM streaming, real-time feeds
WebSocketBidirectionalUpgraded HTTPChat with concurrent input, games
HTTP long pollingRequest/responseHTTPLegacy fallback only

Why SSE is the standard for LLM APIs

Both OpenAI and Anthropic chose SSE for their streaming APIs. The reasons:

  1. Unidirectional fits the LLM pattern. The client sends one request and consumes a stream of generated tokens. There is no need for the client to send additional data after the request — bidirectionality is wasted capability.
  2. SSE is plain HTTP. Every proxy, load balancer, CDN, corporate firewall, and observability tool supports HTTP. SSE works through any infrastructure that supports streaming HTTP responses. WebSocket requires the HTTP-to-WebSocket upgrade handshake, which many enterprise proxies break.
  3. Built-in reconnection. The browser's EventSource API automatically reconnects with the Last-Event-ID header on drops. Server-side, this lets you resume streams without losing tokens.
  4. HTTP/2 multiplexing. Multiple SSE streams share a single underlying TCP connection on HTTP/2, eliminating the historical "6 connections per origin" limit that drove some adopters to WebSocket in the past.
  5. Simpler debugging. An SSE stream is human-readable in tcpdump, curl, and HAR captures. WebSocket frames are binary-encoded and require special tooling to decode.

What an SSE stream actually looks like

An SSE response is a plain HTTP response with Content-Type: text/event-stream and the body formatted as a sequence of events:

HTTP/2 200
content-type: text/event-stream
cache-control: no-cache
connection: keep-alive

event: message_start
data: {"type":"message_start","message":{"id":"msg_01ABC","model":"claude-opus-4"}}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":", world"}}

event: message_stop
data: {"type":"message_stop"}

Each event is a block of lines terminated by a blank line. The event: field gives the event type; data: carries the payload. Multi-line data fields can be split across multiple data: lines, all concatenated.

OpenAI's format is similar but uses simpler structure — each event is just data: <json> with a terminating data: [DONE] sentinel.

The three buffering pitfalls

SSE breaks silently when any layer in the response path buffers the entire response before forwarding. The three usual culprits:

Web server buffering

nginx, the most common reverse proxy in front of LLM gateways, buffers responses by default (4-8 KB). The user experience: nothing visible for several seconds, then the entire response appears at once. Fix:

location /api/stream {
    proxy_buffering off;
    proxy_cache off;
    proxy_read_timeout 600s;
    proxy_pass http://upstream;
}

Add X-Accel-Buffering: no as a response header from the upstream service to disable nginx buffering even if the location block does not configure it.

CDN buffering

CDNs (Cloudflare, CloudFront, Fastly) sometimes buffer entire responses for cacheability or to enable bot-detection scanning. SSE responses must be marked uncacheable AND streaming-aware. Best practices:

  • Send Cache-Control: no-cache, no-transform.
  • Set the SSE path as uncacheable in the CDN configuration.
  • Send Content-Encoding: identity to prevent on-the-fly gzip that some CDNs apply (gzip needs the full body to compress optimally, breaking streaming).

Browser buffering

Some browsers (older Chrome and Safari versions) buffer the first 1-2 KB of any HTTP response while sniffing content type. For an SSE stream this delays the first event by several hundred ms. Fix: send a 2 KB padding comment as the first chunk:

: padding to defeat browser sniff buffer xxxxxxxxxx...
<blank line>

Lines starting with : are SSE comments — ignored by the client but flush the buffer.

Keepalive: the silent connection killer

Long-running SSE streams are idle for tens of seconds at a time when the LLM is between tokens (or between phases like tool use). Intermediate proxies and load balancers idle out connections without traffic — typically after 60-120 seconds of no bytes. The connection drops mid-stream and the user sees a truncated response.

Fix: emit a comment line every 15-30 seconds during gaps:

: keepalive 1716643200
<blank line>

The client ignores comments but the connection is no longer idle. Major LLM providers do this automatically. Self-hosted gateways must implement it.

When WebSocket is actually the right choice

SSE is wrong for a few scenarios. WebSocket wins when:

  • Interactive, mid-stream user input. If the user can interrupt the LLM mid-generation with a follow-up message, you need bidirectional. SSE only allows the client to close the stream, not send new data.
  • Multi-turn conversation in one persistent connection. A WebSocket can carry many request-response pairs without re-establishing TLS handshakes. SSE requires a new HTTP request per generation.
  • Voice or video accompanying text. If the LLM stream is one channel among several (audio, video, control), a WebSocket carrying binary frames is more efficient than juggling parallel SSE + HTTP requests.
  • Strict ordering and exactly-once semantics across messages. Sequenced WebSocket frames provide ordering guarantees that combining separate HTTP requests cannot.

For the chatbot pattern most LLM applications follow, none of these apply, and SSE is the simpler choice.

Cancellation: a cost-control feature

An LLM stream costs money for every token generated, even if no client is reading the output. The user closes the browser tab — without proper cancellation, the LLM keeps generating tokens server-side and you keep paying for them.

SSE cancellation works because closing the underlying HTTP connection is a strong signal. Providers detect socket close within a few hundred ms and stop generation. Client side:

const controller = new AbortController();
const response = await fetch('/api/stream', {
    method: 'POST',
    body: JSON.stringify(payload),
    signal: controller.signal,
});

// later, to cancel:
controller.abort();

WebSocket cancellation requires sending an explicit close message and waiting for the server-side handshake. Slower and more complex.

Multi-tenant LLM gateway patterns

If you are building a service that proxies LLM streams from upstream providers (Anthropic, OpenAI, Bedrock) to your users, several considerations apply:

  • Forward the SSE format unchanged. Re-serializing events adds latency and risks introducing bugs. Pass bytes through.
  • Add your own keepalive layer. Even if the upstream sends keepalives, intermediate caches/proxies in your stack may idle.
  • Track cancellation downstream. When a client disconnects, propagate by closing the upstream connection. Otherwise you pay for orphaned generations.
  • Handle upstream errors mid-stream. Once SSE has started, you cannot change the HTTP status code. Errors must be sent as an error event within the stream.
  • Avoid HTTP/1.1 connection exhaustion. A high-concurrency gateway needs HTTP/2 connections to upstream providers to multiplex many streams over fewer TCP connections.

The fetch() streaming alternative to EventSource

The browser's EventSource API is convenient but has limits — it only supports GET requests and a fixed event format. Modern LLM gateways often use POST with custom auth headers, so fetch() with a ReadableStream is more flexible:

const response = await fetch(url, { method: 'POST', body: payload });
const reader = response.body.getReader();
const decoder = new TextDecoder();

let buffer = '';
while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true });
    let lineEnd;
    while ((lineEnd = buffer.indexOf('\n\n')) >= 0) {
        const event = buffer.slice(0, lineEnd);
        buffer = buffer.slice(lineEnd + 2);
        handleEvent(parseSSE(event));
    }
}

This is what most production LLM clients use. The trade-off: you manage the SSE parsing manually, including resilience to half-events at chunk boundaries.

Frequently Asked Questions

Why do OpenAI and Anthropic use SSE instead of WebSocket?

SSE is the right tool for unidirectional server-to-client streaming. LLM responses are inherently unidirectional — the client sends one request, the server streams the response. WebSocket adds bidirectional complexity (frames, ping/pong, separate close handshake) without any benefit for this pattern. SSE is also plain HTTP, so it passes through every proxy, CDN, and corporate firewall without special configuration. WebSocket frequently breaks behind corporate proxies that do not support the upgrade handshake.

How do I detect a disconnect during streaming?

For SSE, the EventSource API fires the onerror event when the connection drops; on the server side, the request socket closes. For HTTP streaming via fetch(), the ReadableStream's read() returns done:true unexpectedly or throws. Best practice: send periodic keepalive comments (lines starting with :) every 15-30 seconds so intermediate proxies do not idle out the connection, and treat a missed keepalive as a probable disconnect.

What buffering issues affect LLM streaming?

Three layers buffer by default: web servers (nginx buffers 4-8 KB of response by default), CDNs (Cloudflare, CloudFront may buffer entire responses for cacheability), and browsers (some browsers buffer the first 1-2 KB of streaming responses to detect content type). For nginx, set proxy_buffering off on streaming endpoints. For CDN, configure the streaming path as uncacheable. For browsers, send 2+ KB of padding in the first chunk if first-token rendering is slow.

Can I cancel an LLM stream from the client?

Yes — closing the SSE connection or aborting the fetch() with an AbortController stops the server from sending more chunks. Most providers (OpenAI, Anthropic) detect the closed socket within a few hundred ms and cancel the inference, releasing GPU capacity and stopping the token meter. This matters for cost — without proper cancellation, abandoned client connections continue billing for tokens the user will never see.

Does SSE work with HTTP/2 and HTTP/3?

Yes. SSE works over any HTTP version. On HTTP/2 and HTTP/3 the streaming is multiplexed onto a shared connection, so you can have hundreds of concurrent SSE streams without hitting the per-host connection limit that HTTP/1.1 imposed (typically 6 connections per origin). This makes SSE particularly attractive on modern HTTP versions — the historical "WebSocket scales better" argument no longer applies.

Related Guides

More From This Section