What Is VoIP?

Run a Speed Test

VoIP — Voice over Internet Protocol — digitises your voice, compresses it with a codec, and sends it as IP packets across a network. It powers consumer apps like WhatsApp and FaceTime, business phone systems, and the modern telecom backbone.

How VoIP Works

When you speak into a VoIP-enabled device, an analogue-to-digital converter (ADC) samples your voice thousands of times per second — G.711 samples at 8,000 Hz, while wideband codecs like Opus sample at up to 48,000 Hz. Each sample is quantised into a digital value. A codec then compresses batches of these samples into small audio frames, typically 20 milliseconds long, and hands them to the RTP layer for transmission.

RTP (Real-time Transport Protocol) wraps each audio frame in a packet containing a sequence number, timestamp, and synchronisation source identifier, then sends it via UDP. On the receiving end, a jitter buffer collects arriving packets, reorders any that arrived out of sequence, and feeds them to the codec at a steady rate. The codec decompresses each frame, a digital-to-analogue converter (DAC) reconstructs the waveform, and the speaker hears your voice — the entire pipeline adding only tens of milliseconds of end-to-end delay on a well-configured network.

Key VoIP Codecs

A codec (coder-decoder) determines how much bandwidth a call consumes and how good it sounds. G.711 is the original PSTN-quality codec, sampling at 8 kHz with no compression. It produces excellent, natural-sounding voice at 64 kbps, but its bandwidth consumption makes it expensive over constrained links. G.729 uses algebraic code-excited linear prediction (ACELP) to achieve acceptable quality at just 8 kbps — an 8:1 compression ratio. It is widely used in business VoIP systems and SIP trunking where many simultaneous calls share a fixed bandwidth allocation.

Opus is the modern successor, developed by Xiph.Org and standardised by the IETF. It is the mandatory codec for WebRTC and is used by WhatsApp, Discord, and Zoom. Opus is adaptive: it continuously adjusts its bitrate between 6 kbps and 510 kbps based on network conditions and content. For voice, it typically operates at 16–40 kbps in wideband mode (16 kHz sampling), delivering higher clarity than G.711 at a fraction of the bandwidth. Opus also handles music and high-fidelity audio better than any dedicated voice codec.

RTP and RTCP: Media Transport and Quality Monitoring

RTP (Real-time Transport Protocol) is the protocol that carries the actual voice data. It runs over UDP rather than TCP because TCP's retransmission mechanism would arrive too late to be useful in a real-time conversation — a retransmitted packet that arrives 200ms late cannot be played at the right point in the audio stream. RTP's sequence numbers allow the receiver to detect gaps and reorder packets, and the timestamps allow it to maintain timing for playback.

RTCP (RTP Control Protocol) runs alongside RTP on an adjacent port and carries quality statistics in both directions. RTCP Sender Reports include the number of packets sent and the current timestamp mapping. RTCP Receiver Reports include fraction of packets lost, cumulative packet loss count, interarrival jitter, and round-trip delay estimates. These statistics feed into monitoring dashboards and allow VoIP systems to adapt — switching to a lower-bitrate codec or routing around a congested path — when quality degrades.

Why Latency and Jitter Matter More Than Bandwidth

A single G.729 call uses only about 32 kbps including IP, UDP, and RTP headers. Even a slow ADSL connection can handle dozens of simultaneous calls by raw bandwidth alone. What degrades VoIP quality is not bandwidth shortage but timing irregularities. The ITU-T G.114 recommendation specifies that one-way mouth-to-ear delay should remain under 150ms for most applications and below 400ms as an absolute maximum. Beyond 150ms, speakers begin to talk over each other, sensing that the other party is not responding promptly.

Jitter — variation in packet arrival times — is addressed by the jitter buffer, which adds a configurable delay to smooth out arrival irregularities. A larger buffer tolerates more jitter but adds latency. A smaller buffer minimises latency but drops packets that arrive outside its window. Tuning the jitter buffer is a key part of VoIP deployment: too small and quality suffers from frequent packet loss; too large and conversations feel unresponsive. Adaptive jitter buffers measure current network conditions in real time and adjust their depth dynamically.

MOS Score: Measuring VoIP Quality

The Mean Opinion Score (MOS) is the standard metric for VoIP audio quality. Originally derived from human listener panels rating call quality on a scale from 1 (bad) to 5 (excellent), MOS is now calculated algorithmically using models like E-Model (ITU-T G.107) from measurable network parameters including codec type, packet loss percentage, delay, and jitter. A MOS of 4.0 or above is considered good quality. G.711 on a clean network scores around 4.4. G.729 scores around 3.9. Packet loss above 5% typically drops MOS below 3.5 regardless of codec, which most users perceive as unacceptable quality.

VoIP Codecs Compared

Codec Bitrate Quality (MOS) BW per call (incl. headers) Compression Typical use
G.711 (PCMU/PCMA) 64 kbps ~4.4 ~100 kbps None Enterprise LAN, PSTN gateway
G.729 8 kbps ~3.9 ~32 kbps 8:1 SIP trunking, low-bandwidth WAN
G.722 64 kbps ~4.5 ~100 kbps None (wideband) HD voice on enterprise phones
Opus (narrowband) 6–16 kbps ~3.8–4.1 ~25–40 kbps Adaptive Mobile apps on constrained networks
Opus (wideband) 16–40 kbps ~4.3–4.5 ~35–65 kbps Adaptive WebRTC, WhatsApp, Discord, Zoom

VoIP vs PSTN

The Public Switched Telephone Network (PSTN) is the global circuit-switched telephone system built over the 20th century. When a PSTN call connects, a dedicated 64 kbps channel is reserved end-to-end for the entire call duration, even during silences. This guaranteed resource allocation makes PSTN calls highly reliable, but it is also wasteful: voice calls contain significant silence. VoIP is packet-switched: packets share network capacity with all other traffic, and silence suppression algorithms stop sending packets during pauses, reclaiming bandwidth.

VoIP calls are cheaper to route — especially internationally — because they traverse the same IP infrastructure as web and email traffic rather than dedicated telephony circuits. HD codecs make VoIP calls sound better than PSTN when the network is clean. The tradeoff is that packet-switched networks require careful quality of service configuration to prevent voice packets from being delayed by large file downloads or video streams on the same link.

Consumer vs Business VoIP

Consumer VoIP encompasses apps like WhatsApp, FaceTime, Google Meet, and Zoom. These applications handle all signalling, codec negotiation, and NAT traversal internally. Users experience them simply as apps that make calls. Business VoIP spans from hosted PBX services (cloud phone systems where the provider manages all infrastructure) to enterprise UCaaS (Unified Communications as a Service) platforms that combine voice, video, messaging, and presence into one system.

On-premises business deployments run a PBX (Private Branch Exchange) — traditionally hardware, now typically software like Asterisk or FreePBX — that connects desk phones via SIP and bridges to the PSTN through SIP trunks. SIP is the signalling protocol that sets up and tears down these VoIP calls (covered in the companion What Is SIP guide). UCaaS providers like RingCentral, Vonage, and Microsoft Teams Phone handle the PBX in the cloud, requiring only internet-connected phones or softphone apps at each location.

Frequently Asked Questions

How much internet speed do I need for VoIP?

A single VoIP call requires between 8 kbps and 100 kbps of bandwidth depending on the codec. G.729 uses approximately 8–32 kbps per call (including IP/UDP/RTP headers), while G.711 uses around 64–100 kbps. Opus in wideband mode typically falls between 16–40 kbps. For a small business with 10 simultaneous calls using G.729, you need roughly 320 kbps dedicated to voice — well within any modern broadband connection. Bandwidth is rarely the limiting factor. Latency, jitter, and packet loss matter far more: a 100 Mbps connection with 200ms one-way delay will sound worse than a 1 Mbps connection with 50ms delay.

Why does VoIP quality degrade on slow connections?

On slow connections, VoIP packets compete with other traffic for bandwidth. When the link is saturated, packets queue behind large downloads or uploads, introducing variable delay (jitter). Packets that arrive too late to be played out are discarded, which sounds like dropouts or clipping. The jitter buffer absorbs variability by holding incoming packets briefly before playing them, but a larger buffer means more added latency. On very slow links, quality of service (QoS) configuration — which prioritises RTP audio packets over bulk data transfers — is essential for acceptable call quality.

What is a VoIP codec?

A codec (coder-decoder) is an algorithm that compresses and decompresses audio for transmission. VoIP codecs sample your voice, apply compression, and produce a stream of small audio frames — typically 20 milliseconds each. On the receiving end, the codec decompresses the frames and hands them to the speaker. Different codecs make different tradeoffs between bitrate (bandwidth consumed), audio quality (measured in MOS score from 1 to 5), and computational complexity. G.711 produces near-PSTN quality at 64 kbps with no compression. G.729 achieves acceptable quality at just 8 kbps using aggressive compression. Opus adapts its bitrate between 6 kbps and 510 kbps and is the standard for WebRTC.

Is VoIP the same as a regular phone call?

Not exactly. Traditional phone calls travel over the Public Switched Telephone Network (PSTN), a circuit-switched network that reserves a dedicated 64 kbps channel for the entire duration of each call. VoIP is packet-switched: voice is broken into small IP packets that share network capacity with all other traffic. The calling experience sounds similar or better (HD codecs exceed PSTN quality), but the underlying infrastructure is entirely different. Most modern telephone calls — including those made on mobile networks — pass through VoIP infrastructure for part of their journey, so the distinction is increasingly invisible to end users.

How does VoIP handle packet loss?

VoIP uses several techniques to mitigate packet loss. Packet loss concealment (PLC) algorithms detect missing packets and fill the gap with synthesised audio — repeating the last known audio frame or interpolating between frames — so brief losses of 1–3% are often inaudible. Forward error correction (FEC) sends redundant data so the receiver can reconstruct lost packets without retransmission. For losses above 5%, quality degrades noticeably regardless of mitigation. Unlike TCP-based file transfers, VoIP uses UDP and does not retransmit lost packets — retransmission would arrive too late to be useful in a real-time conversation.

What is jitter and why does it affect VoIP?

Jitter is the variation in packet arrival times. In a perfect network, RTP audio packets would arrive at precise 20ms intervals. In reality, network congestion and routing changes cause some packets to arrive early and others late, so the gaps between arrivals vary. VoIP endpoints use a jitter buffer to smooth this variation: packets are held briefly and played out at a steady rate. A small jitter buffer (20–40ms) adds little latency but cannot absorb large variations. A large buffer (80–150ms) handles heavy jitter but adds delay that makes conversations feel unnatural. ITU G.114 recommends keeping one-way mouth-to-ear delay under 150ms for acceptable conversational quality.

Related Guides

More From This Section