Optimizing RTC bandwidth estimation with machine studying

  • Bandwidth estimation (BWE) and congestion management play an essential function in delivering high-quality real-time communication (RTC) throughout Meta’s household of apps.
  • We’ve adopted a machine studying (ML)-based method that enables us to unravel networking issues holistically throughout cross-layers resembling BWE, community resiliency, and transport.
  • We’re sharing our experiment outcomes from this method, a few of the challenges we encountered throughout execution, and learnings for brand spanking new adopters.

Our current bandwidth estimation (BWE) module at Meta is based on WebRTC’s Google Congestion Controller (GCC). We now have made a number of enhancements by means of parameter tuning, however this has resulted in a extra advanced system, as proven in Determine 1.

Determine 1: BWE module’s system diagram for congestion management in RTC.

One problem with the tuned congestion management (CC)/BWE algorithm was that it had a number of parameters and actions that have been depending on community situations. For instance, there was a trade-off between high quality and reliability; enhancing high quality for high-bandwidth customers typically led to reliability regressions for low-bandwidth customers, and vice versa, making it difficult to optimize the consumer expertise for various community situations.

Moreover, we observed some inefficiencies with regard to enhancing and sustaining the module with the advanced BWE module:

  1. As a result of absence of real looking community situations throughout our experimentation course of, fine-tuning the parameters for consumer purchasers necessitated a number of makes an attempt.
  2. Even after the rollout, it wasn’t clear if the optimized parameters have been nonetheless relevant for the focused community sorts.
  3. This resulted in advanced code logics and branches for engineers to take care of.

To resolve these inefficiencies, we developed a machine studying (ML)-based, network-targeting method that provides a cleaner various to hand-tuned guidelines. This method additionally permits us to unravel networking issues holistically throughout cross-layers resembling BWE, community resiliency, and transport.

Community characterization

An ML model-based method leverages time collection information to enhance the bandwidth estimation by utilizing offline parameter tuning for characterised community sorts. 

For an RTC name to be accomplished, the endpoints should be related to one another by means of community units. The optimum configs which have been tuned offline are saved on the server and may be up to date in real-time. In the course of the name connection setup, these optimum configs are delivered to the consumer. In the course of the name, media is transferred instantly between the endpoints or by means of a relay server. Relying on the community indicators collected throughout the name, an ML-based method characterizes the community into differing types and applies the optimum configs for the detected kind.

Determine 2 illustrates an instance of an RTC name that’s optimized utilizing the ML-based method.  

Determine 2: An instance RTC name configuration with optimized parameters delivered from the server and based mostly on the present community kind.

Mannequin studying and offline parameter tuning

On a excessive stage, community characterization consists of two essential parts, as proven in Determine 3. The primary part is offline ML mannequin studying utilizing ML to categorize the community kind (random packet loss versus bursty loss). The second part makes use of offline simulations to tune parameters optimally for the categorized community kind. 

Determine 3: Offline ML-model studying and parameter tuning.

For mannequin studying, we leverage the time collection information (community indicators and non-personally identifiable info, see Determine 6, under) from manufacturing calls and simulations. In comparison with the mixture metrics logged after the decision, time collection captures the time-varying nature of the community and dynamics. We use FBLearner, our inner AI stack, for the coaching pipeline and ship the PyTorch mannequin information on demand to the purchasers initially of the decision.

For offline tuning, we use simulations to run community profiles for the detected sorts and select the optimum parameters for the modules based mostly on enhancements in technical metrics (resembling high quality, freeze, and so forth.).

Mannequin structure

From our expertise, we’ve discovered that it’s vital to mix time collection options with non-time collection (i.e., derived metrics from the time window) for a extremely correct modeling.

To deal with each time collection and non-time collection information, we’ve designed a mannequin structure that may course of enter from each sources.

The time collection information will go by means of a long short-term memory (LSTM) layer that can convert time collection enter right into a one-dimensional vector illustration, resembling 16×1. The non-time collection information or dense information will go by means of a dense layer (i.e., a completely related layer). Then the 2 vectors will likely be concatenated, to completely symbolize the community situation up to now, and handed by means of a completely related layer once more. The ultimate output from the neural community mannequin would be the predicted output of the goal/activity, as proven in Determine 4. 

Determine 4: Mixed-model structure with LSTM and Dense Layers

Use case: Random packet loss classification

Let’s think about the use case of categorizing packet loss as both random or congestion. The previous loss is as a result of community parts, and the latter is as a result of limits in queue size (that are delay dependent). Right here is the ML activity definition:

Given the community situations up to now N seconds (10), and that the community is at present incurring packet loss, the aim is to characterize the packet loss on the present timestamp as RANDOM or not.

Determine 5 illustrates how we leverage the structure to realize that aim:

Determine 5: Mannequin structure for a random packet loss classification activity.

Time collection options

We leverage the next time collection options gathered from logs:

Determine 6: Time collection options used for mannequin coaching.

BWE optimization

When the ML mannequin detects random packet loss, we carry out native optimization on the BWE module by:

  • Growing the tolerance to random packet loss within the loss-based BWE (holding the bitrate).
  • Growing the ramp-up velocity, relying on the hyperlink capability on excessive bandwidths.
  • Growing the community resiliency by sending extra forward-error correction packets to recuperate from packet loss.

Community prediction

The community characterization downside mentioned within the earlier sections focuses on classifying community sorts based mostly on previous info utilizing time collection information. For these easy classification duties, we obtain this utilizing the hand-tuned guidelines with some limitations. The true energy of leveraging ML for networking, nevertheless, comes from utilizing it for predicting future community situations.

We now have utilized ML for fixing congestion-prediction issues for optimizing low-bandwidth customers’ expertise.

Congestion prediction

From our evaluation of manufacturing information, we discovered that low-bandwidth customers typically incur congestion as a result of conduct of the GCC module. By predicting this congestion, we will enhance the reliability of such customers’ conduct. In direction of this, we addressed the next downside assertion utilizing round-trip time (RTT) and packet loss:

Given the historic time-series information from manufacturing/simulation (“N” seconds), the aim is to foretell packet loss resulting from congestion or the congestion itself within the subsequent “N” seconds; that’s, a spike in RTT adopted by a packet loss or an extra development in RTT.

Determine 7 exhibits an instance from a simulation the place the bandwidth alternates between 500 Kbps and 100 Kbps each 30 seconds. As we decrease the bandwidth, the community incurs congestion and the ML mannequin predictions fireplace the inexperienced spikes even earlier than the delay spikes and packet loss happen. This early prediction of congestion is useful in sooner reactions and thus improves the consumer expertise by stopping video freezes and connection drops.

Determine 7: Simulated community situation with alternating bandwidth for congestion prediction

Producing coaching samples

The principle problem in modeling is producing coaching samples for quite a lot of congestion conditions. With simulations, it’s more durable to seize several types of congestion that actual consumer purchasers would encounter in manufacturing networks. Consequently, we used precise manufacturing logs for labeling congestion samples, following the RTT-spikes standards up to now and future home windows based on the next assumptions:

  • Absent previous RTT spikes, packet losses up to now and future are unbiased.
  • Absent previous RTT spikes, we can’t predict future RTT spikes or fractional losses (i.e., flosses).

We break up the time window into previous (4 seconds) and future (4 seconds) for labeling.

Determine 8: Labeling standards for congestion prediction

Mannequin efficiency

In contrast to community characterization, the place floor fact is unavailable, we will acquire floor fact by inspecting the longer term time window after it has handed after which evaluating it with the prediction made 4 seconds earlier. With this logging info gathered from actual manufacturing purchasers, we in contrast the efficiency in offline coaching to on-line information from consumer purchasers:

Determine 9: Offline versus on-line mannequin efficiency comparability.

Experiment outcomes

Listed below are some highlights from our deployment of assorted ML fashions to enhance bandwidth estimation:

Reliability wins for congestion prediction

connection_drop_rate -0.326371 +/- 0.216084
✅ last_minute_quality_regression_v1 -0.421602 +/- 0.206063
✅ last_minute_quality_regression_v2 -0.371398 +/- 0.196064
✅ bad_experience_percentage -0.230152 +/- 0.148308
✅ transport_not_ready_pct -0.437294 +/- 0.400812

peer_video_freeze_percentage -0.749419 +/- 0.180661
✅ peer_video_freeze_percentage_above_500ms -0.438967 +/- 0.212394

High quality and consumer engagement wins for random packet loss characterization in excessive bandwidth

peer_video_freeze_percentage -0.379246 +/- 0.124718
✅ peer_video_freeze_percentage_above_500ms -0.541780 +/- 0.141212
✅ peer_neteq_plc_cng_perc -0.242295 +/- 0.137200

✅ total_talk_time 0.154204 +/- 0.148788

Reliability and high quality wins for mobile low bandwidth classification

✅ connection_drop_rate -0.195908 +/- 0.127956
✅ last_minute_quality_regression_v1 -0.198618 +/- 0.124958
✅ last_minute_quality_regression_v2 -0.188115 +/- 0.138033

✅ peer_neteq_plc_cng_perc -0.359957 +/- 0.191557
✅ peer_video_freeze_percentage -0.653212 +/- 0.142822

Reliability and high quality wins for mobile excessive bandwidth classification

✅ avg_sender_video_encode_fps 0.152003 +/- 0.046807
✅ avg_sender_video_qp -0.228167 +/- 0.041793
✅ avg_video_quality_score 0.296694 +/- 0.043079
✅ avg_video_sent_bitrate 0.430266 +/- 0.092045

Future plans for making use of ML to RTC

From our mission execution and experimentation on manufacturing purchasers, we observed {that a} ML-based method is extra environment friendly in focusing on, end-to-end monitoring, and updating than conventional hand-tuned guidelines for networking. Nonetheless, the effectivity of ML options largely depends upon information high quality and labeling (utilizing simulations or manufacturing logs). By making use of ML-based options to fixing community prediction issues – congestion particularly – we totally leveraged the facility of ML. 

Sooner or later, we will likely be consolidating all of the community characterization fashions right into a single mannequin utilizing the multi-task method to repair the inefficiency resulting from redundancy in mannequin obtain, inference, and so forth. We will likely be constructing a shared illustration mannequin for the time collection to unravel completely different duties (e.g., bandwidth classification, packet loss classification, and so on.) in community characterization. We’ll deal with constructing real looking manufacturing community situations for mannequin coaching and validation. This can allow us to make use of ML to establish optimum community actions given the community situations. We’ll persist in refining our learning-based strategies to boost community efficiency by contemplating current community indicators.