Higher video for cell RTC with AV1 and HD

  • At Meta, we help real-time communication (RTC) for billions of individuals by means of our apps, together with Messenger, Instagram, and WhatsApp.
  • We’ve seen important advantages by adopting the AV1 codec for RTC.
  • Right here’s how we’re bettering the RTC video high quality for our apps with instruments just like the AV1 codec, the challenges we face, and the way we mitigate these challenges.

The previous couple of many years have seen great enhancements in cell phone digicam high quality in addition to video high quality for streaming video providers. But when we take a look at real-time communication (RTC) functions, whereas the video high quality additionally has improved over time, it has all the time lagged behind that of digicam high quality. 

After we checked out methods to enhance video high quality for RTC throughout our household of apps, AV1 stood out as the best choice. Meta has more and more adopted the AV1 codec over time as a result of it provides excessive video high quality at bitrates a lot decrease than older codecs. However, as we’ve implemented AV1 for mobile RTC, we’ve additionally needed to handle quite a lot of challenges together with scaling,  bettering video high quality for low-bandwidth customers in addition to high-end networks, CPU and battery utilization, and sustaining high quality stability.

Enhancing video high quality for low-bandwidth networks

This publish goes to deal with peer-to-peer (P2P, or 1:1) calls, which contain two members. 

Individuals who use our services expertise a variety of community circumstances – some have actually nice networks, whereas others are utilizing throttled or low-bandwidth networks.

This chart illustrates what the distribution of bandwidth seems to be like for a few of these calls on Messenger:

Determine 1: Bandwidth distribution of P2P calls on Messenger.

As seen in Determine 1, some calls function in very low-bandwidth circumstances. 

We think about something lower than 300 Kbps to be a low-end community, however we additionally see lots of video calls working at simply 50 Kbps, and even beneath 25 Kbps.

Word that this bandwidth is the share for the video encoder. Whole bandwidth is shared with audio, RTP overhead, signaling overhead, RTX (re-transmissions of packets to deal with misplaced packets)/FEC (ahead error correction)/duplication (packet duplication), and so forth. The massive assumption right here is that the bandwidth estimator is working appropriately and estimating true bitrates. 

There aren’t any common definitions for low, mid, and excessive networks, however for the aim of this weblog publish, lower than 300 Kbps might be thought of as low, 300-800 Kbps as mid, and above 800 Kbps as a excessive, HD-capable, or high-end community.

After we seemed into bettering the video high quality for low-bandwidth customers, there have been few key choices. Migrating to a more moderen codec corresponding to AV1 introduced the best alternative. Different choices corresponding to higher video scalers and region-of-interest encoding supplied incremental enhancements. 

Video scalers

We use WebRTC in most of our apps, however the video scalers shipped with WebRTC don’t have the very best quality video scaling. We’ve been capable of enhance the video scaling high quality considerably by leveraging in-house scalers. 

At low bitrates, we frequently find yourself downscaling the video to encode at ¼ decision (assuming the digicam seize is 640×480 or 1280×720). With our customized scaler implementations, we’ve got seen important enhancements in video high quality. From public exams we noticed features in peak signal-to-noise ratio (PSNR) by 0.75 db on common.

Here’s a snapshot displaying outcomes with the default libyuv scaler (a field filter):

Determine 2.a: Video picture outcomes utilizing WebRTC/libyuv video scaler.

And the outcomes after downscaling with our video scaler:

Determine 2.b: Video picture outcomes utilizing Meta’s video scaler.

Area-of-interest encoding

Figuring out the area of curiosity (ROI) allowed us to optimize by spending extra encoder bitrate within the space that’s most necessary to a viewer (the speaker’s face in a speaking head video, for instance). Most cell units have APIs to find the face area with out using any CPU overhead. As soon as we’ve got discovered the face area we are able to configure the encoder to spend extra bits on this necessary area and fewer on the remainder. The simplest manner to do that was to have some APIs on encoders to configure the quantization parameters (QP) for ROI versus the remainder of the picture. These modifications offered incremental enhancements within the video high quality metrics like PSNR. 

Adopting the AV1 video codec

The video encoder is a key factor in the case of video high quality for RTC. H.264 has been the preferred codec over the past decade, with {hardware} help and most functions supporting it. However it’s a 20-year-old codec. Again in 2018, the Alliance for Open Media (AOMedia) standardized the AV1 video codec. Since then, a number of firms together with Meta, YouTube, and Netflix have deployed it at a large scale for video streaming

At Meta, transferring from H.264 to AV1 led us to our best enhancements in video high quality at low bitrates.

Determine 3: Enhancements over time, transferring from H.262 to AV1 and H.266

Why AV1?

We selected to make use of AV1 partly as a result of it’s royalty-free. Codec licensing (and concurrent charges) was an necessary facet in our decision-making course of. Usually, if an utility makes use of a tool’s {hardware} codec, no further codec licensing prices might be incurred. But when an utility is transport a software program model of the codec,  there’ll most probably be licensing prices to cowl.

However why do we have to use software program codecs despite the fact that most telephones have hardware-supported codecs?

Most cell units have devoted {hardware} for video encoding and decoding. And as of late most cell units help H.264 and even H.265s. However these encoders are designed for frequent use circumstances corresponding to digicam seize, which makes use of a lot larger resolutions, body charges, and bitrates. Most cell gadget {hardware} is at the moment able to encoding 4K 60 FPS in actual time with very low battery utilization, however the outcomes of encoding a 7 FPS, 320×180, 200 Kbps video are sometimes worse than these of software program encoders working on the identical cell gadget. 

The rationale for that’s prioritization of the RTC use case. Most unbiased {hardware} distributors (IHVs) should not conscious of the community circumstances the place RTC calls function; therefore, these {hardware} codecs should not optimized for RTC eventualities, particularly for low bitrates, resolutions, and body charges. So, we leverage software program encoders when working in these low bitrates to supply high-quality video.

And since we are able to’t ship software program codecs and not using a license, AV1 is an excellent choice for RTC.

AV1 for RTC

The largest cause to maneuver to a extra superior video codec is easy: The identical high quality expertise might be delivered with a a lot decrease bitrate, and we are able to ship a a lot higher-quality real-time calling expertise for our customers who’re on bandwidth-constrained networks.

Measuring video high quality is a fancy matter, however a comparatively easy manner to take a look at it’s to make use of the Bjontegaard Delta-Bit Rate (BD-BR) metric. BD-BR compares how a lot bitrate numerous codecs want to supply a sure high quality degree. By producing a number of samples at completely different bitrates, measuring the standard of the produced video offers a rate-distortion (RD) curve, and from the RD curve you possibly can derive the BD-BR (as proven under).

As might be seen in Determine 4, AV1 offered larger high quality for all bitrate ranges in our native exams.

Determine 4: Bitrate distortion comparability chart.

Display screen-encoding instruments

AV1 additionally has a number of key instruments which can be helpful for RTC. Display screen content material high quality is changing into an more and more necessary issue for Meta, with related use circumstances, together with display screen sharing, sport streaming, and VR distant desktop, requiring high-quality encoding. In these areas, AV1 really shines. 

Historically, video encoders aren’t properly suited to advanced content material corresponding to textual content with lots of high-frequency content material, and people are delicate to studying blurry textual content. AV1 has a set of coding instruments—palette mode and intra-block copy—that drastically enhance efficiency for display screen content material. Palette mode is designed in accordance with the remark that the pixel values in a screen-content body normally think about the restricted variety of colour values. Palette mode can symbolize the display screen content material effectively by signaling the colour clusters as an alternative of the quantized transform-domain coefficients. As well as, for typical display screen content material, repetitive patterns can normally be discovered inside the identical image. Intra-block copy facilitates block prediction inside the identical body, in order that the compression effectivity might be improved considerably. That AV1 offers these two instruments on the baseline profile is a large plus.

Reference image resampling: Fewer key frames

One other helpful characteristic is reference image resampling (RPR), which permits decision modifications with out producing a key body. In video compression, a key body is one which’s encoded independently, like a nonetheless picture. It’s the one sort of body that may be decoded with out having one other body as reference. 

For RTC functions, for the reason that bandwidth retains on altering usually, there are frequent decision modifications wanted to adapt to those community modifications. With older codecs like H.264, every of those decision modifications requires a key body that’s a lot bigger in dimension and thus inefficient for RTC apps. Such massive key frames enhance the quantity of information needing to be despatched over the community and lead to larger end-to-end latencies and congestion. 

By utilizing RPR, we are able to keep away from producing any key frames.

Challenges round bettering video high quality for low-bandwidth customers

CPU/battery utilization

AV1 is nice for coding effectivity, however codecs obtain this at the price of larger CPU and battery utilization. Quite a lot of trendy codecs pose these challenges when working real-time functions on cell platforms.

Based mostly on native lab testing, we anticipated a roughly 4 p.c enhance in battery utilization, and we noticed comparable ends in public exams. We used an influence meter to do that native battery measurement.

Although the AV1 encoder itself elevated CPU utilization three-fold when in comparison with H.264 implementation, the general contribution of CPU utilization from the encoder was a small a part of the battery utilization. The cellphone show display screen, networking/radio, and different processes utilizing the CPU contribute considerably to battery utilization, therefore the rise in battery utilization was 5-6 p.c (a big enhance in battery utilization). 

Quite a lot of calls run out of gadget battery, or folks grasp up as soon as their working system signifies a low battery, so growing battery utilization isn’t worthwhile for customers until it offers elevated worth corresponding to video high quality enchancment. Even then it’s a trade-off between video high quality versus battery use.

We use WebRTC and Session Description Protocol (SDP) for codec negotiation, which permits us to barter a number of codecs (e.g., AV1 and H.264) up entrance after which change the codecs with none want for signaling or a handshake through the name. This implies the codec change is seamless, with out customers noticing any glitches or pauses in video.

We created a customized encoder that encapsulates each H.264 and the AV1 encoders. We name it a hybrid encoder. This allowed us to change the codec through the name based mostly on triggers corresponding to CPU utilization, battery degree, or encoding time — and to change to the extra battery-efficient H.264 encoder when wanted. 

Elevated crashes and out of reminiscence errors

Even with out new leaks added, AV1 used extra reminiscence than H.264. Any time further reminiscence is used, apps usually tend to hit out of reminiscence (OOM) crashes or hit OOM sooner due to different leaks or reminiscence calls for on the system from different apps. To mitigate this, we needed to disable AV1 on units with low reminiscence. That is one space for enchancment and for additional optimizing the encoder’s reminiscence utilization.

In-product high quality measurement

To match the standard between H.264 and AV1 utilizing public exams, we wanted a low-complexity metric. Metrics corresponding to encoded bitrates and body charges gained’t present any features as the overall bandwidth obtainable to ship video remains to be the identical, as these are restricted by the community capability, which implies the bitrates and body charges for video won’t change a lot with the change within the codec. We had been utilizing composite metrics that mix the quantization parameter (QP is commonly used as a proxy for video high quality, as this introduces pixel knowledge loss through the encoding course of), resolutions, and body fee, and freezes it to generate video composite metrics, however QP is just not comparable between AV1 and H.264 codecs, and therefore can’t be used.

PSNR is a normal metric, but it surely’s reference-based and therefore doesn’t work for RTC. Non-reference, video-quality metrics are fairly CPU-intensive (e.g., BRISQUE: Blind/Referenceless Picture Spatial High quality Evaluator), although we’re exploring these as properly.

Determine 6: Excessive-level structure for PSNR computation in RTC.

We’ve provide you with a framework for PSNR computation. We first modified the encoder to report distortions attributable to compression (most software program encoders have already got help for this metric). Then we designed a light-weight, scaling-distortion algorithm that estimates the distortion launched by video scaling. This algorithm can mix these scaling distortions with the encoder distortions to supply output PSNR. We developed and verified this algorithm regionally and might be sharing the findings in publications and at educational conferences over the subsequent 12 months. With this light-weight PSNR metric, we noticed 2 db enhancements with AV1 in comparison with H.264.

Challenges round bettering video high quality for high-end networks

As a fast overview: For our functions, excessive bandwidth covers customers for whom bandwidth is bigger than 800 kbps. 

Over time, there have been big enhancements in digicam seize high quality. In consequence, folks’s expectations have gone up, and so they need to see RTC video high quality on par with native digicam seize high quality. 

Based mostly on native testing, we settled on settings leading to video high quality that appears just like that of digicam recordings. We name this HD mode. We discovered that with a video codec like H.264 encoding at 3.5 Mbps and 30 frames per second, 720p decision seemed similar to native digicam recordings. We additionally in contrast 720p to 1080p in subjective high quality exams and located that the distinction is just not noticeable on most units aside from these with a bigger display screen after we performed subjective high quality exams.

Bandwidth estimator enhancements

Enhancing the video high quality for customers who’ve high-end telephones with good CPUs, good batteries, {hardware} codecs, and good community speeds appears trivial. It could look like all it’s a must to do is enhance the utmost bitrate, seize decision, and seize body charges, and customers will ship high-quality video. However, in actuality, it’s not that easy. 

In case you enhance the bitrate, you expose your bandwidth estimation and congestion detection algorithm to hit congestion extra usually, and your algorithm might be examined many extra occasions than if you weren’t utilizing these larger bitrates. 

Determine 7: Instance displaying how utilizing larger bandwidth will increase the situations for congestion.

In case you take a look at the community pipeline in Determine 7, the upper the bitrates you’re utilizing, the extra your algorithm/code might be examined for robustness over the time of the RTC name. Determine 7 reveals how utilizing 1 Mbps hits extra congestion than utilizing 500 Kbps and utilizing 3 Mbps hits extra congestion than 1 Mbps, and so forth. If you’re utilizing bandwidths decrease than the minimal throughput of the community, nonetheless, you gained’t hit congestion in any respect. For instance, see the 500-Kbps name in Determine 7. 

To mitigate these points, we improved congestion detection. For instance, we added customized ISP throttling detection, one thing that was not being caught by the standard delay-based estimator of WebRTC. 

Bandwidth estimator and community resilience comprise a fancy space on their very own, and that is the place RTC merchandise stand out. They’ve their very own customized algorithms that work finest for his or her merchandise and prospects.

Steady high quality

Individuals don’t like oscillations in video high quality. These can occur after we ship high-quality video for a number of seconds after which drop again to low-quality due to congestion. Studying from previous historical past, we added help in  bandwidth estimation to stop these oscillations.

Audio is extra necessary than video for RTC

When community congestion happens, all media packets may very well be misplaced. This causes video freezes and damaged audio, (aka, robotic audio). For RTC, each are dangerous, however audio high quality is extra necessary than video. 

Damaged audio usually fully prevents conversations from taking place, usually inflicting folks to hold up or redial the decision. Damaged video, however, usually ends in much less pleasant conversations, however, relying on the state of affairs, it is also a block for some customers.

At excessive bitrates like 2.5 Mbps and better, you possibly can afford to have three to 5 occasions extra audio packets or duplication with none noticeable degradation to video. When working in these larger bitrates with cellphone connections, we noticed extra of those congestion, packet loss, and ISP throttling points, so we needed to make modifications to our community resiliency algorithms. And since persons are extremely delicate to knowledge utilization on their cell telephones, we disabled excessive bitrates on mobile connections.

When to allow HD?

We used ML-based concentrating on to guess which name ought to be HD-capable. We relied on the community stats from the customers’ earlier calls to foretell if HD ought to be enabled or not.

Battery regressions

We’ve a number of metrics, together with efficiency, networking, and media high quality, to trace the standard of RTC calls. After we ran exams for HD, we observed regressions in battery metrics. What we discovered was that the majority battery regressions don’t come from larger bitrates or decision however from the seize body charges.

To mitigate the regressions, we constructed a mechanism for detecting each caller and callee gadget capabilities, together with gadget mannequin, battery ranges, Wi-Fi or cell utilization, and so forth. To allow high-quality modes, we verify each side of the decision to make sure that they fulfill the necessities and solely then will we allow these high-quality, resource-intensive configurations.

Determine 8: Signaling server setup for turning HD on or off.

What the longer term holds for RTC

{Hardware} producers are acknowledging the numerous advantages of utilizing AV1 for RTC. The brand new Apple iPhone 15 Professional helps AV1’s {hardware} decoder, and the Google Pixel 8 helps AV1 encoding and decoding. {Hardware} codecs are an absolute necessity for high-end community and HD resolutions. Video calling is changing into as ubiquitous as conventional audio calling and we hope that as {hardware} producers acknowledge this shift, there might be extra alternatives for collaboration between RTC app creators and {hardware} producers to optimize encoders for these eventualities. 

On the software program aspect, we’ll proceed to work on optimizing AV1 software program encoders and growing new encoder implementations. We attempt to present one of the best expertise for our customers, however on the identical time we need to let folks have full management over their RTC expertise. We’ll present controls to the customers in order that they will select whether or not they need larger high quality at the price of battery and knowledge utilization, or vice versa.

We additionally plan to work with IHVs to collaborate on {hardware} codec growth to make these codecs usable for RTC eventualities together with low-bandwidth use circumstances. 

We additionally will examine forward-looking options corresponding to video processing to extend the decision and body charges on the receiver’s rendering stack and leveraging AI/ML to enhance bandwidth estimation (BWE) and community resiliency.

Additional, we’re investigating Pixel Codec Avatar applied sciences that may permit us to transmit the mannequin/share as soon as after which ship the geometry/vectors for receiver aspect rendering. This allows video rendering with a lot smaller bandwidth utilization than conventional video codecs for RTC eventualities.