AI Inference: Edge vs Cloud
Inference can happen anywhere — on a phone, on a laptop, on a nearby edge POP, on a regional data center, on a centralized GPU cluster on the other side of the country. The right placement depends on model size, latency target, input modality, privacy requirements, and cost shape. There is no one answer; there is a coherent framework for choosing.
The placement spectrum
| Tier | Where | Typical model size | Latency to user |
|---|---|---|---|
| On-device | Phone, laptop, IoT device | 100M-8B params (quantized) | 0 ms network; ~10ms-1s inference |
| Local network | Home server, on-prem GPU | Up to ~70B params | 1-10 ms network + inference |
| CDN edge | POP near user | Small to medium models | 10-30 ms network + inference |
| Regional cloud | Same-continent data center | Any size | 30-100 ms network + inference |
| Centralized cloud | Hyperscaler GPU cluster | Frontier-scale models | 50-200 ms network + inference |
What drives the choice
Five axes:
- Model capability needed. The largest models (hundreds of billions of parameters) only run on centralized GPU clusters. Phones can run small distilled models; quality is task-dependent.
- Latency budget. Voice assistants need <100ms wake response. Real-time translation needs <200ms. Chat is fine at 500ms-2s TTFT. Tighter budgets push inference closer to the user.
- Privacy / data sensitivity. If inputs cannot leave the device (medical, financial, certain enterprise), on-device is mandatory.
- Input data size. Small text: cheap to send anywhere. Video frames at 30 fps: expensive to send; do inference locally.
- Cost structure. Per-call cloud APIs are expensive at scale; per-device or per-region capacity flips that for high-volume workloads.
The bandwidth equation
For text, bandwidth is trivial. For multimodal workloads:
- A 720p video frame is ~300 KB compressed. At 30 fps that's 9 MB/s per camera.
- An hour of HD video is several GB.
- Round-trip-time to a regional cloud is 30-100 ms; to a centralized cloud is 50-200 ms.
For real-time video AI (security cameras, retail analytics, vehicle perception), sending raw streams to the cloud is bandwidth-prohibitive. Edge inference produces a small structured output (object boxes, classifications) that travels cheaply.
On-device LLM constraints
Modern phones with dedicated NPUs can run quantized small LLMs at acceptable speeds. The constraints:
- RAM. A 4-bit-quantized 7B model is ~3.5 GB. Phones with 8 GB RAM can host one with reserve for the OS. Below that, smaller models only.
- Memory bandwidth. Decode is memory-bandwidth-bound. Phone memory bandwidth is 50-100 GB/s; data center HBM is 3000+ GB/s. Per-token decode is correspondingly slower.
- Battery. Sustained inference is power-hungry. Continuous LLM streaming drains a battery quickly.
- Thermal. Phones throttle when hot. Heavy inference for minutes degrades throughput.
For short bursts (a few seconds of generation), on-device is fine. For long-running agentic workloads, the phone is the wrong host.
The hybrid pattern
The most common production architecture is hybrid: small tasks on device, large tasks in cloud. Examples:
- Wake-word and intent classification on device. Full conversation in cloud.
- Real-time camera object detection on edge. Cloud summarization for analytics.
- Local first-pass query understanding. Cloud retrieval + generation for the answer.
- On-device draft model. Cloud verifies via speculative decoding.
The data flow choreography is the engineering challenge — caching, fallback, partial results.
Edge model deployment
For edge / on-device, getting the model there is a network problem. Approaches:
- Bundle in app binary. Simple but bloats download size and updates require app re-release.
- Download on first run. Smaller initial binary; one-time bandwidth hit per device.
- Differential updates. Model patches, not full re-downloads, when weights change.
- Streaming layer-by-layer load. Begin inference before the whole model downloads. Rare in practice.
For CDN-edge deployment, models are stored at the edge POP itself, replicated across POPs by the CDN's distribution mechanism.
Cost model differences
| Placement | Cost shape |
|---|---|
| On-device | Zero marginal per-call cost. App size and battery are the cost. |
| Local network | Capital cost of hardware, amortized; near-zero per-call. |
| Edge POP | Per-POP capacity rented from CDN; often per-request pricing. |
| Cloud API | Per-token or per-call pricing; scales linearly with usage. |
For high-volume workloads, the unit economics of on-device or self-hosted often beat per-call APIs. For variable or low-volume workloads, per-call cloud APIs are cheaper because they amortize the capacity across many customers.
Offline capability
If the application must function without internet, on-device is mandatory. Offline use cases include in-vehicle assistants where cellular coverage is spotty, field workers in remote areas, secure facilities, and applications where users frequently lose connectivity.
Frequently Asked Questions
What is the difference between edge and cloud inference?
Cloud inference runs models in centralized data centers — large GPUs, big models, no per-user hardware. Edge inference runs models closer to users: on the device itself (phone, laptop), on a nearby edge POP, or on a regional data center. The tradeoffs are model size (cloud can run much larger models), latency (edge is closer), bandwidth (edge avoids sending data to the cloud), and cost structure (cloud is per-call, edge is per-device or per-region capacity).
What models can run on a phone?
As of the mid-2020s, models in the 1B-8B parameter range run on flagship phones using quantization (int4 or int8). Apple, Google, and Samsung have integrated dedicated NPUs that accelerate this. Models above 10B parameters generally need more RAM and bandwidth than current phones provide. Specialized small models for narrow tasks (intent classification, simple summarization) work well; general-purpose chat is constrained.
When does on-device inference make sense?
When latency matters (sub-100ms targets), when data should not leave the device (privacy-sensitive inputs), when the application must work offline, or when per-call cloud costs would dominate at scale. Voice assistants doing wake-word and intent detection, on-device translation, photo organization, and predictive text are common cases.
What is edge AI in the CDN sense?
CDN-style edge AI runs inference at the CDN POPs — close to users, far from origin data centers. Used for low-latency tasks like content classification, personalization, or running small models on per-request data. Distinct from on-device edge: the CDN edge is server infrastructure, just placed near users instead of centralized.
How does network bandwidth affect the choice?
For text-only LLM workloads, network bandwidth is rarely the bottleneck — token streams are small. For multimodal inputs (images, video, audio), sending data to cloud inference can be the dominant cost. Edge inference handles the data locally and sends only the inference result, which is much smaller. Real-time video AI (object detection, gesture recognition) typically belongs at the edge for this reason.
Related Guides
Self-Hosted Inference Networking
Network considerations when you run the inference server yourself.
Inference Server Architecture
The system that hosts cloud and self-hosted inference.
CDN vs Edge Computing
Edge in the content-delivery sense and how AI edge fits.
Edge Functions Explained
The runtime model that hosts CDN-edge inference.
More From This Section
All AI & LLM Networking Guides
LLM API latency, streaming, prompt caching, RAG, and inference architecture.
Batching vs Streaming Tradeoffs
How static, dynamic, and continuous batching affect LLM throughput and per-request latency, and why streaming output is…
Context Window and Token Budgets
How context windows are measured, why long contexts cost more than proportional compute, the quality cliff with long…
Run a Speed Test
Measure download, upload, ping, and jitter in your browser.