AI Inference: Edge vs Cloud

Q: What is the difference between edge and cloud inference?

Cloud inference runs models in centralized data centers — large GPUs, big models, no per-user hardware. Edge inference runs models closer to users: on the device itself (phone, laptop), on a nearby edge POP, or on a regional data center. The tradeoffs are model size (cloud can run much larger models), latency (edge is closer), bandwidth (edge avoids sending data to the cloud), and cost structure (cloud is per-call, edge is per-device or per-region capacity).

Q: What models can run on a phone?

As of the mid-2020s, models in the 1B-8B parameter range run on flagship phones using quantization (int4 or int8). Apple, Google, and Samsung have integrated dedicated NPUs that accelerate this. Models above 10B parameters generally need more RAM and bandwidth than current phones provide. Specialized small models for narrow tasks (intent classification, simple summarization) work well; general-purpose chat is constrained.

Q: When does on-device inference make sense?

When latency matters (sub-100ms targets), when data should not leave the device (privacy-sensitive inputs), when the application must work offline, or when per-call cloud costs would dominate at scale. Voice assistants doing wake-word and intent detection, on-device translation, photo organization, and predictive text are common cases.

Q: What is edge AI in the CDN sense?

CDN-style edge AI runs inference at the CDN POPs — close to users, far from origin data centers. Used for low-latency tasks like content classification, personalization, or running small models on per-request data. Distinct from on-device edge: the CDN edge is server infrastructure, just placed near users instead of centralized.

Q: How does network bandwidth affect the choice?

For text-only LLM workloads, network bandwidth is rarely the bottleneck — token streams are small. For multimodal inputs (images, video, audio), sending data to cloud inference can be the dominant cost. Edge inference handles the data locally and sends only the inference result, which is much smaller. Real-time video AI (object detection, gesture recognition) typically belongs at the edge for this reason.

Inference can happen anywhere — on a phone, on a laptop, on a nearby edge POP, on a regional data center, on a centralized GPU cluster on the other side of the country. The right placement depends on model size, latency target, input modality, privacy requirements, and cost shape. There is no one answer; there is a coherent framework for choosing.

The placement spectrum

Tier	Where	Typical model size	Latency to user
On-device	Phone, laptop, IoT device	100M-8B params (quantized)	0 ms network; ~10ms-1s inference
Local network	Home server, on-prem GPU	Up to ~70B params	1-10 ms network + inference
CDN edge	POP near user	Small to medium models	10-30 ms network + inference
Regional cloud	Same-continent data center	Any size	30-100 ms network + inference
Centralized cloud	Hyperscaler GPU cluster	Frontier-scale models	50-200 ms network + inference

What drives the choice

Five axes:

Model capability needed. The largest models (hundreds of billions of parameters) only run on centralized GPU clusters. Phones can run small distilled models; quality is task-dependent.
Latency budget. Voice assistants need <100ms wake response. Real-time translation needs <200ms. Chat is fine at 500ms-2s TTFT. Tighter budgets push inference closer to the user.
Privacy / data sensitivity. If inputs cannot leave the device (medical, financial, certain enterprise), on-device is mandatory.
Input data size. Small text: cheap to send anywhere. Video frames at 30 fps: expensive to send; do inference locally.
Cost structure. Per-call cloud APIs are expensive at scale; per-device or per-region capacity flips that for high-volume workloads.

The bandwidth equation

For text, bandwidth is trivial. For multimodal workloads:

A 720p video frame is ~300 KB compressed. At 30 fps that's 9 MB/s per camera.
An hour of HD video is several GB.
Round-trip-time to a regional cloud is 30-100 ms; to a centralized cloud is 50-200 ms.

For real-time video AI (security cameras, retail analytics, vehicle perception), sending raw streams to the cloud is bandwidth-prohibitive. Edge inference produces a small structured output (object boxes, classifications) that travels cheaply.

On-device LLM constraints

Modern phones with dedicated NPUs can run quantized small LLMs at acceptable speeds. The constraints:

RAM. A 4-bit-quantized 7B model is ~3.5 GB. Phones with 8 GB RAM can host one with reserve for the OS. Below that, smaller models only.
Memory bandwidth. Decode is memory-bandwidth-bound. Phone memory bandwidth is 50-100 GB/s; data center HBM is 3000+ GB/s. Per-token decode is correspondingly slower.
Battery. Sustained inference is power-hungry. Continuous LLM streaming drains a battery quickly.
Thermal. Phones throttle when hot. Heavy inference for minutes degrades throughput.

For short bursts (a few seconds of generation), on-device is fine. For long-running agentic workloads, the phone is the wrong host.

The hybrid pattern

The most common production architecture is hybrid: small tasks on device, large tasks in cloud. Examples:

Wake-word and intent classification on device. Full conversation in cloud.
Real-time camera object detection on edge. Cloud summarization for analytics.
Local first-pass query understanding. Cloud retrieval + generation for the answer.
On-device draft model. Cloud verifies via speculative decoding.

The data flow choreography is the engineering challenge — caching, fallback, partial results.

Edge model deployment

For edge / on-device, getting the model there is a network problem. Approaches:

Bundle in app binary. Simple but bloats download size and updates require app re-release.
Download on first run. Smaller initial binary; one-time bandwidth hit per device.
Differential updates. Model patches, not full re-downloads, when weights change.
Streaming layer-by-layer load. Begin inference before the whole model downloads. Rare in practice.

For CDN-edge deployment, models are stored at the edge POP itself, replicated across POPs by the CDN's distribution mechanism.

Cost model differences

Placement	Cost shape
On-device	Zero marginal per-call cost. App size and battery are the cost.
Local network	Capital cost of hardware, amortized; near-zero per-call.
Edge POP	Per-POP capacity rented from CDN; often per-request pricing.
Cloud API	Per-token or per-call pricing; scales linearly with usage.

For high-volume workloads, the unit economics of on-device or self-hosted often beat per-call APIs. For variable or low-volume workloads, per-call cloud APIs are cheaper because they amortize the capacity across many customers.

Offline capability

If the application must function without internet, on-device is mandatory. Offline use cases include in-vehicle assistants where cellular coverage is spotty, field workers in remote areas, secure facilities, and applications where users frequently lose connectivity.

Frequently Asked Questions

What is the difference between edge and cloud inference?

Cloud inference runs models in centralized data centers — large GPUs, big models, no per-user hardware. Edge inference runs models closer to users: on the device itself (phone, laptop), on a nearby edge POP, or on a regional data center. The tradeoffs are model size (cloud can run much larger models), latency (edge is closer), bandwidth (edge avoids sending data to the cloud), and cost structure (cloud is per-call, edge is per-device or per-region capacity).

What models can run on a phone?

As of the mid-2020s, models in the 1B-8B parameter range run on flagship phones using quantization (int4 or int8). Apple, Google, and Samsung have integrated dedicated NPUs that accelerate this. Models above 10B parameters generally need more RAM and bandwidth than current phones provide. Specialized small models for narrow tasks (intent classification, simple summarization) work well; general-purpose chat is constrained.

When does on-device inference make sense?

When latency matters (sub-100ms targets), when data should not leave the device (privacy-sensitive inputs), when the application must work offline, or when per-call cloud costs would dominate at scale. Voice assistants doing wake-word and intent detection, on-device translation, photo organization, and predictive text are common cases.

What is edge AI in the CDN sense?

CDN-style edge AI runs inference at the CDN POPs — close to users, far from origin data centers. Used for low-latency tasks like content classification, personalization, or running small models on per-request data. Distinct from on-device edge: the CDN edge is server infrastructure, just placed near users instead of centralized.

How does network bandwidth affect the choice?

For text-only LLM workloads, network bandwidth is rarely the bottleneck — token streams are small. For multimodal inputs (images, video, audio), sending data to cloud inference can be the dominant cost. Edge inference handles the data locally and sends only the inference result, which is much smaller. Real-time video AI (object detection, gesture recognition) typically belongs at the edge for this reason.

Run a Speed Test

Related Guides

Self-Hosted Inference Networking

Network considerations when you run the inference server yourself.

Inference Server Architecture

The system that hosts cloud and self-hosted inference.

CDN vs Edge Computing

Edge in the content-delivery sense and how AI edge fits.

Edge Functions Explained

The runtime model that hosts CDN-edge inference.