Nathan Moore (StackPath Director, Principal Software Engineer, DevOps) speaks at Streaming Media East in New York, NY on May 16, 2017 about optimizing video delivery.
In the old days, delivering video was a lot easier. A station would broadcast the content for everybody in the immediate vicinity to tune-in. With traditional over-the-air delivery, individual user experience was less of a concern. The main concern was the overall or primary viewer experience. That’s all changed in a world of one-to-one distribution. The end-user is contacting a server and negotiating their own connection, which means the experience of one user can be wildly different from the next person, even over the same networks and using the same equipment. To address this issue in today’s one-to-one world, delivering video on the web or over-the-top requires an optimized distribution system.
There are some general guidelines and a couple of basic objectives we need to achieve to optimize video:
- We require low latencies.
- We require high bandwidth.
- We require high reliability.
- Most importantly, we require scalability.
Bandwidth and Latencies: One server is not enough.
Your server has some finite amount of bandwidth assigned to it. The more people you add, the less bandwidth you have. It’s basic math. If you have 100 viewers, divide bandwidth by 100. If you gain 900 more viewers, everyone now has one-thousandth of the bandwidth available as opposed to one-hundredth of the total bandwidth. The more people that you get, the smaller amount of that bandwidth they can each claim.
Bandwidth and latency are our two key metrics. Bandwidth is the size of the pipes, the amount of data that can be sent in a given second. The higher the bandwidth, the more data you can send. The next factor, latency, is the time it takes for information to get from one device to another. Time-to-first-byte is a nice representation of that. The relationship between bandwidth and latency is not as intuitive as you may think. I’ll explain later why latency and bandwidth actually inter-operate, and why they’re essentially two sides of the same performance coin.
In the meantime, consider the following thought experiment. While brainstorming this topic, I considered what devices everyone has at the StackPath office. As it turns out, half of those in office have iPhones. If I care about those iPhone users, I need to figure out how to get video to them. iPhones use a video protocol called HLS. So, the next step is determining how the HLS object gets to the iPhone users. We know iPhones work on a 4G network. We also know the problems with 4G networks (i.e., spotty connections, latencies depending on the location of the user, etc.). Let's say we want to deliver a video to tens of thousands of iPhones simultaneously using HLS over a 4G connection.
The problem that I’ll have is one that a lot of providers don’t think about, because of the old broadcast model. With over-the-air, you could have just one central broadcast tower that serves everybody in the local area. That’s no longer the case. The whole wide world can talk to your server, and if it’s very far away from the users, they’re going to have a very bad time with long latency. The lesson here is, one or even a few servers are not enough.
Reliability: Optimize for every layer.
One foundational technology we need to cover is HLS, also known as HTTP live streaming. It relies on the hypertext transfer protocol, or HTTP. However, HTTP itself has a further dependency on TCP, the transmission control protocol, in order to transfer the object in the first place. So we end up with multiple layers of different protocols that have to operate and inter-operate correctly. And if even one of those protocols doesn’t do its job properly, the end user will have issues with the video. You have to optimize all the possible layers if you’re determined to get a quality product with a high quality experience.
Let’s start with TCP and work our way back up the stack. The goal of TCP is to maintain network quality because you can’t have a high-quality experience without high network quality. Break that up into flow control and congestion control. Flow control determines the fastest data rate you can negotiate. This speaks to the reliability of your network. If you’re dropping packets, information has to be re-transmitted. With this method, this is done automatically and also helps with congestion control. Your bandwidth is divided amongst a bunch of people. And congestion control is a way to ensure that everybody gets a fair share of that bandwidth and prevents any one user from starving everyone else.
The reason why latency and bandwidth actually inter-operate, and why they’re essentially two sides of the same performance coin is that TCP doesn’t send a constant stream of information. Instead, it takes a chunk of information, we can call it a packet. A packet is sent out and then the network waits. It waits to get a response back from whatever device it was trying to talk to. That creates a problem. The longer the latency, the longer the time-to-first-byte, the longer it takes to go between the server and the client, and the less bandwidth we get because we have to wait until that information is acknowledged. What happens if that information is not acknowledged? Then the server has to retransmit, which is terrible in a long latency environment because it doubles the wait time between sending some information and receiving the acknowledgement of delivery.
Moving back to our thought experiment, let’s say you want to stream video from a server to an iPhone. Your video is encoded at 1 megabit per second (1 Mbps). Obviously, these numbers are deliberately bunched together to make the math super-simple. In the real world, you’ll get slightly different measurements. We know we’re going to send from the server to the iPhone. The iPhone takes one second. We have one-second latency. So it takes one second to get to the iPhone, and one second for the iPhone to talk back to the server. As long as TCP can negotiate a speed of 1Mbps, the iPhone can play the 1Mbps encoded video at its intended rate, and we do that by sending 2Mb. The iPhone sends its acknowledgement back, with a pause of one second for the server to receive. Sending that 2Mb gives us the 1Mbps that we needed.
But what happens if we have loss? We’re still encoded at 1Mbps. We send 2Mb of data. The iPhone never receives it, so it can't send an acknowledgement. The server waits two seconds, and says, “I know you’re a second away, so it should’ve taken two seconds for me to get that packet and I didn’t get it. I better retransmit.” This time, the iPhone gets it, sends its acknowledgement back, and another two seconds has passed. Now do the math. We just sent 2Mb in four seconds, which is only a half-megabit per second. If you’re watching a video encoded at 1Mbps, what happens? Delay. Stuttering. Buffering. This is why retransmits are so incredibly important.
The key is to have a variety of encodings available. If you have a smart protocol like HLS, you can negotiate up and down. So if you are in this sort of lossy, variable-latency case (as with the 4G network), you can actually negotiate down to a lower-bandwidth encoded video stream. That way your end viewer can at least still see your video, even under lossy and variable-latency conditions where the quality of the experience isn’t quite as high, but the video is at least still continuous.
Reach and Scalability: Users can be anywhere.
HLS or HTTP Live Streaming depends on HTTP. It also has a dependence on TCP, moving back up the layer. When HTTP sends an object, we call it an HTTP object, which is this discreet thing that is entirely self-contained. What is important is that we can pass this unitary object from server to server to server unchanged, allowing us to chain servers together. If I have a server that’s local to that iPhone, I can use that server to serve the data even if my content server is half the world away. This allows for a very low-latency connection between the iPhone and my server, while having my initial origin data half a world away through a caching or proxy server. Through a proxy server, a whole new world of performance opens up. I can focus on the connection between my iPhone and my proxy/caching server rather than the connection between the iPhone and the origin server. Through strategically placed proxy servers, I can shorten the physical distance between the user and my server, thus lowering latency and improving performance. Again the relationship between bandwidth and latency allows us to optimize. With lower latency, you get higher bandwidth, and therefore better quality experience.
As I’ve mentioned, the video streaming concept, HLS, has a HTTP dependency, which means it has to be able to break down your video stream into discrete chunks. HLS takes a stream and slices it and dices it into different chunks. As a result, it also maintains an ordered scheme or what’s called an M3U8 file (MP3URLUTF8 is its formal name). In short, it’s basically the playlist. The best thing about it is that you can define multiple streams within it. So you can have a very high-bandwidth stream, a medium-bandwidth stream, and a low bandwidth stream. The client will choose the next best bandwidth stream in the sequence based on what it thinks is the available bandwidth. If there is a lot of bandwidth, it takes the high bandwidth-encoded stream. If there is low bandwidth, it automatically takes the low bandwidth-encoded stream.
Localized caching proxy servers combined with variable bandwidth-encoded streams allows users to download these individual chunks discretely which makes for low-latency, high-quality connections to the end user even if your origin server is half a world away. It is just about ensuring a high quality experience.