Embedding API Networking
Embedding APIs share a category with LLM completion APIs — same providers, similar billing models — but they have a very different networking profile. Each call is small, latency is dominated by network rather than inference, and the response is a fixed-size array of floats instead of a token stream. The optimization patterns that matter for completions (streaming, prompt caching, careful prompt design) are different from the patterns that matter for embeddings (batching, dimension choice, quantization).
What an embedding endpoint does
Input: a string (or array of strings). Output: a fixed-dimensional vector per input. The vector is the result of running the input through an encoder-style transformer and pooling the output (typically a learned [CLS] token or mean pooling). Inputs with similar meaning produce vectors that are close together in cosine-similarity space.
The most common downstream use is vector search: store many document embeddings, embed a query at search time, find documents whose vectors are closest to the query vector.
Bandwidth math
| Dim | float32 size | float16 size | int8 size | 1M docs at float32 |
|---|---|---|---|---|
| 384 | 1.5 KB | 768 B | 384 B | 1.5 GB |
| 768 | 3 KB | 1.5 KB | 768 B | 3 GB |
| 1024 | 4 KB | 2 KB | 1 KB | 4 GB |
| 1536 | 6 KB | 3 KB | 1.5 KB | 6 GB |
| 3072 | 12 KB | 6 KB | 3 KB | 12 GB |
For high-volume indexing pipelines, the network cost of moving embeddings can exceed the inference cost. A pipeline embedding 100 million documents at 1536 dimensions in float32 moves 600 GB of vector data alone — before considering the inputs sent to the API.
Batching is non-negotiable
Embedding APIs accept arrays of inputs per request. A batched request returns an array of vectors. The per-input overhead of HTTP, TLS, and request parsing is amortized across the batch. Practical implications:
- 1 input per request, 1000 documents: 1000 HTTP requests, dominated by network overhead.
- 50 inputs per request, 1000 documents: 20 HTTP requests, dominated by inference time.
- 500 inputs per request, 1000 documents: 2 HTTP requests, server-batching gains saturate.
The right batch size depends on the API's stated limits and the latency requirements. For background indexing, batch as much as the API allows. For interactive search where one query needs one embedding, batching doesn't apply.
Dimension and quality tradeoffs
Higher dimensions generally produce more accurate retrieval but cost more storage, bandwidth, and search time. The right dimension depends on the corpus size and latency targets:
- Under 100K documents — even 1536 dims is fine; storage and search latency are small.
- 1M to 10M documents — consider 768 or 1024 dims to control storage and ANN index size.
- Over 100M documents — quantization (binary or scalar) plus reduced dimensions; the storage cost dominates.
Matryoshka models help here: train once, store full-dim, query at lower-dim with no re-embedding cost.
Quantization on the wire
Most embedding APIs return float32 by default. Many also offer quantized variants:
- float16 / bfloat16 — half the bytes, minor accuracy impact.
- int8 — quarter the bytes, larger accuracy impact, useful for storage.
- binary — one bit per dimension, 32x smaller than float32, suitable for first-stage retrieval.
For storage and bandwidth, quantizing post-API on your side is also an option. Stored quantized; rehydrated to float for similarity math at query time.
The request shape
A canonical embedding request:
POST /v1/embeddings
Content-Type: application/json
{
"model": "embedding-model-v1",
"input": ["text 1", "text 2", "..."],
"encoding_format": "float"
}
The response:
{
"data": [
{"index": 0, "embedding": [0.123, -0.456, ...]},
{"index": 1, "embedding": [0.789, -0.012, ...]}
],
"usage": {"prompt_tokens": 42, "total_tokens": 42}
}
Caching embeddings
Embeddings are deterministic — the same input always produces the same vector. This means client-side caching is straightforward: hash the input, look up the cached vector, skip the API call if hit. For high-volume pipelines processing largely-duplicate or slowly-changing content, cache hit rates of 70-95% are normal. The cost savings are direct.
Latency profile vs LLM completions
| Operation | Server compute | Network cost | Typical end-to-end |
|---|---|---|---|
| Single embedding | 10-50 ms | RTT + small response | 50-200 ms |
| Batch of 100 embeddings | 50-200 ms | RTT + larger response | 100-400 ms |
| LLM completion (short) | 500ms-5s | RTT + streamed tokens | 1-10 s |
Embeddings are an order of magnitude faster per call than LLM completions. The network is a much bigger share of total time, which is why caching, batching, and keeping the embedding service network-close to the application matter.
Frequently Asked Questions
What is an embedding API?
An embedding API converts text (or images, audio, code) into a fixed-length vector of floating-point numbers that represents the input in a high-dimensional semantic space. Texts with similar meaning produce vectors that are close together by cosine similarity. The API takes a string, returns an array of floats — typically 384, 768, 1024, 1536, or 3072 dimensions depending on the model.
How much bandwidth does an embedding response use?
Each embedding is dimensions × 4 bytes for float32, or × 2 bytes for float16, or × 1 byte for int8 quantized. A 1536-dim float32 embedding is 6 KB. A batch of 100 such embeddings is 600 KB. For high-volume indexing pipelines this adds up — bandwidth often becomes the bottleneck, not inference compute.
What is the difference between request and embedding latency?
Embedding inference is generally much faster than LLM inference because there is no autoregressive decode — the input goes through the model once, the output is the embedding. Typical embedding latency is 10-100 ms per request depending on input length and model size. Most of the end-to-end latency at the client is network time, not compute.
Should I batch embedding requests?
Almost always yes. Embedding APIs accept arrays of inputs in a single request and return arrays of vectors. Batching dramatically reduces per-item network overhead and lets the server pack multiple inputs through the model in one forward pass. Typical batch sizes are 50-1000 inputs depending on input length and API limits.
What is Matryoshka embedding and why does it matter for bandwidth?
Matryoshka representation learning trains embedding models so that the first N dimensions of a full embedding are themselves a usable lower-dimensional embedding. You can request a 3072-dim embedding, store the full vector, and use the first 256 dims for fast retrieval — without re-training or re-embedding. For network and storage costs at scale, the ability to truncate without quality loss is significant.
Related Guides
RAG Architecture Network Patterns
The most common downstream use of embedding APIs.
Tokens, Bytes, and Bandwidth
The completion-side bandwidth model for comparison.
Inference Server Architecture
How embedding and completion services share or split infrastructure.
LLM API Cost Optimization
Where embedding caching fits in cost reduction strategy.
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
AI Inference: Edge vs Cloud
How to choose between on-device, edge-network, and centralized cloud inference — covering latency, bandwidth, privacy,…
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.