SIP vs RTP: Signalling and Media Are Separate
One of the most important concepts in VoIP architecture is that call signalling and media transport are handled by entirely different protocols. SIP is responsible for signalling: it sends the messages that say "I want to call you," "I am ringing," "I accept," and "I am hanging up." RTP (Real-time Transport Protocol) is responsible for actually carrying the encoded audio and video frames between the two endpoints once the call is established.
This separation is intentional. Signalling messages are small, infrequent, and can tolerate modest delays. Media packets are small, extremely frequent (50 packets per second for a typical 20ms frame interval), and require low, consistent latency. By using different protocols — SIP typically over UDP or TCP, RTP exclusively over UDP — each can be optimised independently. A SIP message might traverse a proxy chain across the internet; the RTP stream it sets up might flow directly between the two endpoints on the shortest available path.
SIP Message Structure
SIP messages are text-based and deliberately modelled on HTTP. They fall into two categories: requests and responses. SIP requests have a method name, a Request-URI identifying the target, and a SIP version on the first line — for example, INVITE sip:alice@example.com SIP/2.0. SIP responses have a three-digit status code and a reason phrase, mirroring HTTP's structure: SIP/2.0 200 OK.
The core SIP methods are: INVITE — initiates a call session; ACK — confirms receipt of a final response to an INVITE; BYE — terminates an established session; CANCEL — cancels a pending INVITE before it is answered; REGISTER — a SIP device registers its current location with a registrar server; OPTIONS — queries a server or endpoint for its capabilities without initiating a call. Each request and response carries headers (From, To, Call-ID, CSeq, Via, Contact, Content-Type) that route, identify, and sequence the messages.
SIP URIs and Addressing
SIP uses Uniform Resource Identifiers to address users and services, following the format sip:user@domain or sip:user@host:port. A SIP URI looks similar to an email address — sip:alice@company.com — and resolves through DNS SRV records to find the IP address and port of the responsible SIP server. The encrypted variant, sips:user@domain, signals that TLS transport must be used for all hops. SIP URIs can also encode phone numbers in E.164 format using the tel: URI scheme, enabling interoperability with the PSTN: sip:+14155552671@pstn-gateway.example.com.
SIP Network Components
A SIP deployment consists of several distinct functional roles. The User Agent Client (UAC) is the entity that sends SIP requests — your SIP phone or softphone app when you initiate a call. The User Agent Server (UAS) receives and responds to SIP requests — the other party's phone. A single device is both UAC and UAS simultaneously during a call. The SIP Proxy routes requests from one UAC to the appropriate UAS, similar to an email MTA. The Registrar accepts REGISTER requests and maintains a location database mapping SIP addresses to current IP addresses. The Redirect Server responds to requests with the address of the next hop rather than forwarding them. The Back-to-Back User Agent (B2BUA) acts as both a UAS and a UAC simultaneously, terminating one SIP dialog and creating another — used by IP PBX systems and Session Border Controllers.
A Complete SIP Call Flow
Understanding a SIP call requires tracing each message. Alice's phone sends an INVITE to the SIP proxy, containing an SDP offer listing Alice's codecs and the IP/port where she wants to receive RTP. The proxy returns 100 Trying immediately — a provisional response confirming receipt. The proxy forwards the INVITE to Bob's phone, which returns 180 Ringing, causing Alice's phone to play a ringback tone. When Bob picks up, his phone sends 200 OK with an SDP answer selecting a mutually supported codec and specifying Bob's RTP receive address. Alice's phone sends ACK to confirm, completing the three-way handshake. RTP media now flows directly between Alice and Bob's endpoints, completely bypassing the SIP proxy. When Alice hangs up, her phone sends BYE, Bob's phone responds 200 OK, and the session is terminated.
SDP: Negotiating Media Parameters
The Session Description Protocol (SDP) is the format used inside SIP message bodies to describe what media the caller wants to exchange. An SDP body is a sequence of lines in the form type=value. The critical fields are: c= (connection data — the IP address for media), m= (media description — "audio 49170 RTP/AVP 0 8 97" means audio on port 49170 using RTP, preferring codec payload types 0=G.711 PCMU, 8=G.711 PCMA, 97=a dynamically assigned codec), and a=rtpmap:97 opus/48000/2 mapping payload type 97 to Opus.
The SDP offer/answer model works like a negotiation: the caller lists all codecs it supports in preference order; the callee picks one and returns it in the answer. Both parties then use that codec for the call. This same mechanism handles video streams, data channels, and mid-call modifications — for example, when a call is put on hold, a re-INVITE is sent with a=sendonly in the SDP to pause the RTP stream.
SIP Trunking
A SIP trunk is a logical connection between a business PBX and a carrier's SIP infrastructure, replacing physical ISDN or analogue telephone lines with an IP-based connection. The carrier provides a pool of telephone numbers (Direct Inward Dialling numbers), routes inbound calls to the business PBX via SIP INVITE, and accepts outbound calls from the PBX to reach the PSTN. Unlike physical lines where capacity is fixed by the number of installed pairs or channels, SIP trunk capacity is a software configuration — adding ten more simultaneous call channels takes seconds and no physical installation.
Businesses use Session Border Controllers (SBCs) at the edge of their network to handle SIP trunk connections. The SBC performs NAT traversal, topology hiding (concealing internal PBX addresses from the carrier), codec transcoding (converting between internal G.711 and carrier-preferred G.729), and security functions like rate limiting and denial-of-service protection against SIP floods.
SIP vs H.323 vs WebRTC
| Feature | SIP | H.323 | WebRTC |
|---|---|---|---|
| Protocol type | Text-based (HTTP-like) | Binary (ASN.1) | Browser API + protocol suite |
| Signalling transport | UDP / TCP / TLS | TCP | WebSocket (application-defined) |
| Media protocol | RTP / SRTP | RTP | DTLS-SRTP |
| NAT traversal | STUN / TURN / ICE / SBC | H.460 / ALG | ICE (built-in) |
| Browser native | No (via SIP.js/WebSocket) | No | Yes |
| Typical deployment | Enterprise PBX, SIP trunking, UCaaS | Legacy video conferencing | Browser calling, Meet, Teams, Zoom |
NAT Traversal Challenges
NAT (Network Address Translation) is the mechanism that allows many devices to share a single public IP address. It is ubiquitous in home and office networks and is one of the most significant sources of complexity in SIP deployments. The problem is that SIP embeds the device's IP address and RTP port directly in SDP message bodies and in Via and Contact headers. A device behind NAT reports its private IP (such as 192.168.1.50) in these fields. The remote party or SIP proxy cannot route RTP media to that address because it is not reachable from the internet.
The standard solution stack involves STUN (Session Traversal Utilities for NAT), which lets a device discover its public IP address and NAT-mapped port by querying a STUN server. TURN (Traversal Using Relays around NAT) provides a media relay server for cases where direct NAT traversal is impossible. ICE (Interactive Connectivity Establishment) systematically gathers multiple candidate address/port pairs and tries them in priority order to find the path that works. Session Border Controllers at the network edge can also rewrite SDP addresses to substitute the correct public addresses, handling NAT traversal centrally rather than on each endpoint.
Frequently Asked Questions
What is the difference between SIP and VoIP?
VoIP is a broad category describing any voice communication carried over IP networks. SIP is one specific protocol used within VoIP systems to handle signalling — setting up, modifying, and terminating sessions. A VoIP call typically involves two separate protocols: SIP negotiates the call parameters (who is calling whom, which codecs to use, what IP addresses and ports to send audio to), and RTP carries the actual voice packets once the call is established. You can have VoIP without SIP (using H.323 or proprietary protocols), but SIP is by far the most widely deployed signalling protocol for VoIP today.
What port does SIP use?
SIP uses port 5060 for unencrypted signalling over UDP or TCP, and port 5061 for encrypted SIP over TLS (sometimes called SIPS). The SIP specification allows either UDP or TCP as the transport, and modern deployments increasingly use TCP or TLS for reliability and security. Firewalls must permit SIP signalling on these ports and also permit the dynamic RTP media ports (typically in the range 10000–20000 UDP) that are negotiated via SDP during call setup. Failing to open the RTP port range is one of the most common causes of one-way audio in SIP deployments.
What is a SIP trunk?
A SIP trunk is a virtual connection between a business phone system (PBX) and the Public Switched Telephone Network (PSTN) delivered over an IP connection rather than physical telephone lines. The SIP trunk provider routes SIP signalling and RTP media between the business PBX and the global telephone network. Businesses replace expensive ISDN PRI circuits with SIP trunks, which are cheaper, more flexible (capacity can be scaled up or down instantly), and support direct inward dialling (DID) to individual extensions. Most SIP trunk providers charge per channel (simultaneous call) or per minute rather than per physical line.
Why does SIP have NAT traversal problems?
SIP embeds IP addresses and port numbers inside the SDP message body and SIP headers. When a SIP client sits behind a NAT router, it reports its private IP address (e.g. 192.168.1.10) in these fields. The remote party or proxy server cannot reach that private address from the internet. The SIP signalling may traverse the NAT via the router's stateful tracking, but the RTP media stream is sent to the private IP address listed in SDP — which is unreachable. Solutions include STUN (discovers the public IP/port), TURN (relays media through a public server), ICE (tries multiple candidate paths), and Session Border Controllers (SBCs) that rewrite SDP addresses at the network edge.
What is the difference between SIP and WebRTC?
SIP is a text-based signalling protocol designed for VoIP systems, typically used by desk phones, softphones, and PBX systems. WebRTC is a browser API and protocol suite (using ICE, DTLS-SRTP, and SCTP) designed for real-time communication directly from web browsers without plugins. WebRTC handles its own signalling at the application layer — there is no mandated signalling protocol, though SIP over WebSocket and proprietary JSON signalling are both common. SIP deployments are common in enterprise telephony and carrier networks; WebRTC is the foundation for browser-based calling in platforms like Google Meet, Teams, and Zoom. Session Border Controllers often bridge between WebRTC endpoints and SIP infrastructure.
What is SDP in SIP?
SDP stands for Session Description Protocol (RFC 4566). It is not a standalone protocol but a text format carried inside SIP message bodies to describe the media parameters of a session. An SDP body specifies the codec list in preference order, the IP address and UDP port where the caller wants to receive RTP audio, the media type (audio, video, application), timing information, and optional attributes like the direction of the stream (sendrecv, sendonly, recvonly). During a SIP INVITE, the caller includes an SDP offer. The callee's 200 OK includes an SDP answer choosing from the offered codecs. This offer/answer model is how the two endpoints agree on a common codec before RTP media begins flowing.