Untangling the WebRTC Flow

Dan Norman

22 November 2016

Think of this as a companion guide to the official w3 spec. With better pictures. In plain English.

This is an attempt at a flow-centric, rather than code-centric, description of WebRTC. It’s designed to help you better understand the plumbing by walking you chronologically through the formation of a peer-to-peer WebRTC connection, placing each of the various WebRTC Javascript API and out-of-band signaling calls inside the context of the flow.

Essentially it’s “what I wish I had seen in the WebRTC draft spec, but wasn’t there.”

Incidentally, the w3 spec is still in draft, complete with dubious warnings like, “This example flow needs to be discussed on the list and is likely wrong in many ways.” When you see things like that you get a little nervous. Consider this your sanity check.

If you crave after general WebRTC info or specific working code examples, don’t read this. There’s plenty of resources on what WebRTC is, the history, and why you would want to use it.

Some extra resources & supportability

If you’re looking for a simple high-level framework for using WebRTC’s features without an explanation of the plumbing, I highly recommend you check out the SimpleWebRTC toolkit written by the folks at &yet, as well as Talky, the demo app they created.

A word on supportability: the standard and even the APIs are still in flux, and while WebRTC is slowly gaining acceptance, as of today, only Chrome, Firefox and Opera support the standard in their stable releases. You can find detailed and up-to-date compatibility information at Is WebRTC Ready Yet.

The magic that is WebRTC

WebRTC is a web standard for Peer to Peer (P2P) voice and video chats (and data) that works natively in the browser (read: no plugins!). Unlike the complex, proprietary voice/video platforms of ages past, the WebRTC standard is intended to be implemented solely with HTML5 and a relatively simple Javascript API.

See why it's awesome?

The most confusing part about WebRTC

When I was first exposed to WebRTC, the most confusing and difficult part was realizing that there are actually three semi-asynchronous flows of logic that you need to follow in order to understand how a connection is made:

  1. The first flow is the WebRTC Javascript callback logic. Essentially, this logic handles all the browser-level handling of WebRTC.
  2. The second flow uses Session Description Protocol (SDP) messaging. This is the signaling logic that happens outside of the WebRTC connection to set up the P2P connections between the two people wanting to chat.
  3. And third, the STUN/TURN server ICE (Interactive Connectivity Establishment) messaging that assist with NAT traversal and media/data relay fallbacks.

I use the fake term “semi”-asynchronous because during the connection sequence, the Javascript callback logic and Signaling logic flows are variously triggered by each other, and fork off one another at different times, so keep that in mind if you’re ever scratching your head. It takes a bit of time to wrap your head around. I also created a gigantic diagram down below that definitely clarifies everything...Hopefully.

Understanding WebRTC’s signaling

Let’s talk a bit more about the SDP and ICE messaging flows (which together make up WebRTC's out-of-band Signaling), as they will come up continually during the rest of the post:

So to recap, SDP and ICE messages are part of Signaling, which involves at least two third party servers in order to broker the connection: a Signaling server, which relays SDP messages between peers, and a STUN server (which we refer to in our diagrams as Alice's ICE and Bob's ICE), which sends a set of ICE messages to the peers (yeah, I know, there's technically a TURN server too if P2P fails, but we'll ignore that for now).

If for some reason you find WebRTC signaling intriguing (or if you're confused), the HTML5Rocks WebRTC infrastructure guide provides some additional info.

Also, we don't cover the case where P2P fails and TURN servers are needed. Again, the HTML5Rocks infrastructure guide will help.

Security

Finally, and most importantly, you are 100% responsible for encryption and security of the SDP and ICE messages sent during the signaling phase. After that, you’re home free, because once the connection is set up, WebRTC connections are encrypted by default (pretty baller, you can see the pdf here).

However, if you fail to make your signaling secure, either by allowing your server to get pwned or by not properly encrypting the signaling channels, you are setting yourself up to get man-in-the-middled. Hard. So secure your stuff, people.

Off to the races

Now that we’ve cleared these things up, it is time to look at the 5,000-foot view of a WebRTC logic flow between Alice and Bob (click the image to see the large version):

Now that’s an eye-opener, no mistake. Let’s break that up a bit, and look at each chunk in turn.

Initialization phase

Alice kicks things off:

There are four steps to initialization1: First is to instantiate your RTCPeerConnection. Second, you need to define your callback functions and assign them to the RTCPeerConnection object. Third, call getUserMedia to get your camera’s video stream. Finally, create the SDP Message by invoking onNegotiationNeeded, createOffer and setLocalDescription.

Step 1 · Instantiate your RTCPeerConnection object. This takes a list of STUN servers. STUN servers talk over the ICE protocol. So when you see STUN, think ICE, and viceversa. Google has graciously allowed access to a free STUN server that is incredibly useful during development, but it’s probably a good idea to roll your own in a high volume production app.

Step 2 · Define callbacks.The RTCPeerConnection object you just created needs to be given life! To do this you must create three callback functions (onIceCandidate, onNegotiationNeeded and onAddStream) and assign them to the RTCPeerConnection. Pay attention, because we’ll be using some of these later.

Step 3 · onIceCandidate. This function handles responses made from our STUN server to your browser regarding NAT/Firewall traversal. You will need to create a function that accepts an RTCIceCandidate object.RTCIceCandidate objects look a lot like this:

{
  sdpMLineIndex: 1,
  sdpMid: "video",
  candidate: "candidate:3789462185 2 tcp 1518280447 10.0.0.7 0 typ host tcptype active generation 0"
}

You don’t have to care what this means, but you do have to be able to perfectly communicate this information to Bob who is just sitting around waiting for you to send him a message via your signaling server.

Step 4 · onNegotiationNeeded. If something happens that requires a new session negotiation, this is triggered. In most cases, it's triggered by the “Allow” button, which fires off an event that this function listens for.

Step 5 · onAddStream. This is called when you call setRemoteDescription with the SDP info or your remote peer. It’s where you handle the video stream from your remote buddy. Most folks place this stream in a video element and call it day, but you can do whatever you want with it, really. Just try not to break the law.

Step 6 · Call getUserMedia. getUserMedia is a function that allows you to get your camera’s video stream and do whatever you want with it. It takes three parameters: constraints, a success callback function, and a failure callback function:

Step 7 · constraints. Do you want to allow video or audio? Both? What size should the stream be? Black and white only? Add filters? Constraints help you customize this kind of info. For a full treatment on constraints you may want to put your hazmat suit on, and look at the the official w3 spec.

Step 8 · success. If the “allow” button is clicked and all goes according to plan, this function will execute. In here you will handle the media stream from your local camera. All the cool kids add this to a video element to represent yourself in the video call. Probably in the corner, tiny. But maybe you really like to look at yourself and decide it should be a giant full screen monstrosity. Hey, that's your prerogative. Go for it!

Step 9 · failure. Just log the error to console and call it a day.

Using the callbacks

Remember that onNegotiationNeeded callback function we defined? Of course you do. Well, as soon as Alice clicks “allow,” the onNegotiationNeeded callback is triggered, and we start the session negotiation. This is done by calling createOffer.

RTCPeerConnection’s createOffer method generates an SDP object. Be aware that createOffer takes an &lquo;on-success-callback&rquo; function that you will need to implement that handles the generated SDP. It also takes an error callback so you can write to console about your abject failure to create this SDP message. createOffer’s success callback will give us access to the generated SDP message. Oooh, yeah! Are you feeling the callback hell yet? Get ready, it’s just getting warmed up.

The first thing we do with our new SDP object is tell our RTCPeerConnection all about the local session description protocol. This is done, unsurprisingly with setLocalDescription, which also takes a success callback function. In this callback, you will send your SDP info to your remote peer via the third party signaling server you set up. Third party signaling server? Did we talk about that yet, you ask? Yes, yes we did.

ICE negotiation phase

As soon as setLocalDescription is called, the ICE negotiation starts rolling. Remember those STUN servers we defined with the RTCPeerConnection in Step 1 way back up at the top? Well, setLocalDescription is going to ask the servers to generate some ICE candidates encapsulated in the lovely RTCIceCandidate object that we looked at earlier. The STUN server is going to generate some ICE candidates and send them back to us.

We handle these ICE candidates using the onIceCandidate callback we defined earlier. onIceCandiate will do nothing more than take the RTCIceCandidate objects generated by the STUN server and send them via our signaling server to the remote peer.

At this point, Alice is done setting up, has her webcam on, and is waiting on Bob to send Bob’s SDP and ICE Candidate messages so that she can send Bob her lovely face!

Bob's side of things

So Bob’s browser was just hanging out, sitting around listening for messages from someone. As soon as it receives either an ICE candidate or an SDP message, it kicks off the same initial process we went over when Alice initiated the call.

At this point, the flow forks into two asynchronous paths. One path triggers when Bob receives an ICE candidate. The other, when Bob receives the SDP message. We’ll cover both forks, starting with the ICE Message.

Bob receives an ICE message

If the type of message is an an ICE candidate, Bob is going to need to add this ICE candidate to his RTCPeerConnection with the addIceCandidate method. He does this by first building an RTCIceCandidate object from the message, and then passes this object to addIceCandidate.

That’s it!

Now Bob knows how to get from his computer all the way to Alice’s computer without bouncing off a third party server.

Bob receives an SDP message

If this message is an SDP message type, Bob will need to let his RTCPeerConnection know all about how Alice’s media is structured, her encryption, and a host of other things encapsulated in the Session Description Protocol. Bob accomplishes this by creating an RTCSessionDescription object and then passing it into the setRemoteDescription method.

setRemoteDescription also takes success and callback functions. On success, setRemoteDescription will first check to see if the SDP type it received was an “offer.” If it was something other than offer then it will cease execution. In our current stage in the workflow the SDP type was “offer.” so Bob will need to generate his own SDP information. He does this through the handy createAnswer method.

createAnswer has a success callback that will handle the generated SDP of type “answer.” The success callback will call the setLocalDescription method on Bob’s RTCSessionDescription object. As we saw earlier, setLocalDescription itself has a success callback function whose sole purpose is to communicate this SDP information to Alice via our signaling server.

Alice receives SDP and ICE messages

We’re very nearly complete with the signaling! All that needs to happen now is for Alice to receive and process any SDP and ICE messages from Bob.

Set remote description (Alice)

When Alice receives an SDP message from Bob, she will use this data to create an RTCSessionDescription object and pass it to setRemoteDescription. Now Alice knows about Bob’s media meta data, as well as the encryption information required for the two to set up a secure connection.

Unlike Bob, however, the SDP message Alice is receiving is of type “answer” so nothing further happens with the SDP data.

ICE negotiation completed

The next message that comes rolling in is of type Candidate. When Alice receives a candidate message she does just as Bob did with her message. She creates an RTCIceCandidate object with the message's data, and then passes it to the addIceCanddiate method on her RTCPeerConnection object. Now Alice knows how to reach Bob’s computer from her own, without having to make use of a third party server to relay the data.

Final steps

All this does is allow Bob to see Alice’s lovely face. Alice still can’t see Bob!

Why? Well, this whole process happens within a few hundred milliseconds, and at this point, Bob is probably still moving his mouse towards the “Accept” button to indicate he is willing to allow the browser access to his camera’s video stream. So while Bob can see Alice’s video stream, his video stream has not yet been communicated to Alice.

Once he clicks “Accept” a RenegotiationNeeded event is fired, which executes the function we defined for the callback onNegotiationNeeded. Since Bob and Alice have the exact same code running in their browser, the entire process outlined above gets kicked off. The only difference is that now Alice is learning about Bob, whereas before Bob was learning about Alice.

OK, now we're actually done. Victory!

1 A caveat on initialization and getUserMedia: In other implementations, getUserMedia can be called independently of this initialization process. We're assuming an implementation where Alice and Bob have only a single "Call" button. In this implementation, the "Call" button triggers the instantiation of RTCPeerConnection, defines the callback functions including onNegotiationNeeded, and only then requests access to your webcam (by calling getUserMedia). It then awaits the "Allow" event, which triggers onNegotiationNeeded (which then triggers createOffer, and so on). You don't have to do it this way. For example, consider an implementation where your "Turn on Webcam" button is completely separate from the "Call" button. You can call getUserMedia at any time, including before you define your callback functions. If you use this method, you don't need to (and in fact, shouldn't) use onNegotiationNeeded. Instead, the "Call" button would set up RTCPeerConnection and the callbacks, and then would call createOffer directly, with no need for onNegotiationNeeded. However, if you want to use onNegotiationNeeded, you will need to use our implementation and define it before calling getUserMedia. Return to article
Contact Information

PKC Security
8092 Warner Avenue
Huntington Beach, CA 92647

info@pkcsecurity.com