Recommending for Lengthy-Time period Member Satisfaction at Netflix | by Netflix Expertise Weblog | Aug, 2024

By Jiangwei Pan, Gary Tang, Henry Wang, and Justin Basilico

Our mission at Netflix is to entertain the world. Our personalization algorithms play an important function in delivering on this mission for all members by recommending the fitting exhibits, motion pictures, and video games on the proper time. This aim extends past speedy engagement; we purpose to create an expertise that brings lasting enjoyment to our members. Conventional recommender methods typically optimize for short-term metrics like clicks or engagement, which can not totally seize long-term satisfaction. We try to advocate content material that not solely engages members within the second but in addition enhances their long-term satisfaction, which will increase the worth they get from Netflix, and thus they’ll be extra more likely to proceed to be a member.

One easy means we are able to view suggestions is as a contextual bandit drawback. When a member visits, that turns into a context for our system and it selects an motion of what suggestions to indicate, after which the member gives varied kinds of suggestions. These suggestions alerts will be speedy (skips, performs, thumbs up/down, or including objects to their playlist) or delayed (finishing a present or renewing their subscription). We will outline reward capabilities to replicate the standard of the suggestions from these suggestions alerts after which practice a contextual bandit coverage on historic information to maximise the anticipated reward.

There are a lot of ways in which a suggestion mannequin will be improved. They could come from extra informative enter options, extra information, completely different architectures, extra parameters, and so forth. On this publish, we deal with a less-discussed facet about enhancing the recommender goal by defining a reward operate that tries to raised replicate long-term member satisfaction.

Member retention would possibly appear to be an apparent reward for optimizing long-term satisfaction as a result of members ought to keep in the event that they’re happy, nonetheless it has a number of drawbacks:

  • Noisy: Retention will be influenced by quite a few exterior components, resembling seasonal developments, advertising and marketing campaigns, or private circumstances unrelated to the service.
  • Low Sensitivity: Retention is just delicate for members on the verge of canceling their subscription, not capturing the complete spectrum of member satisfaction.
  • Onerous to Attribute: Members would possibly cancel solely after a sequence of dangerous suggestions.
  • Gradual to Measure: We solely get one sign per account per 30 days.

As a consequence of these challenges, optimizing for retention alone is impractical.

As a substitute, we are able to practice our bandit coverage to optimize a proxy reward operate that’s extremely aligned with long-term member satisfaction whereas being delicate to particular person suggestions. The proxy reward r(consumer, merchandise) is a operate of consumer interplay with the really useful merchandise. For instance, if we advocate “One Piece” and a member performs then subsequently completes and provides it a thumbs-up, a easy proxy reward is likely to be outlined as r(consumer, merchandise) = f(play, full, thumb).

Click on-through price (CTR)

Click on-through price (CTR), or in our case play-through price, will be seen as a easy proxy reward the place r(consumer, merchandise) = 1 if the consumer clicks a suggestion and 0 in any other case. CTR is a typical suggestions sign that usually displays consumer choice expectations. It’s a easy but robust baseline for a lot of suggestion purposes. In some circumstances, resembling advertisements personalization the place the clicking is the goal motion, CTR might even be an inexpensive reward for manufacturing fashions. Nevertheless, normally, over-optimizing CTR can result in selling clickbaity objects, which can hurt long-term satisfaction.

Past CTR

To align the proxy reward operate extra carefully with long-term satisfaction, we have to look past easy interactions, contemplate all kinds of consumer actions, and perceive their true implications on consumer satisfaction.

We give a couple of examples within the Netflix context:

  • Quick season completion ✅: Finishing a season of a really useful TV present in in the future is a powerful signal of enjoyment and long-term satisfaction.
  • Thumbs-down after completion ❌: Finishing a TV present in a number of weeks adopted by a thumbs-down signifies low satisfaction regardless of vital time spent.
  • Taking part in a film for simply 10 minutes ❓: On this case, the consumer’s satisfaction is ambiguous. The transient engagement would possibly point out that the consumer determined to desert the film, or it may merely imply the consumer was interrupted and plans to complete the film later, maybe the subsequent day.
  • Discovering new genres ✅ ✅: Watching extra Korean or sport exhibits after “Squid Sport” suggests the consumer is discovering one thing new. This discovery was probably much more helpful because it led to a wide range of engagements in a brand new space for a member.

Reward engineering is the iterative technique of refining the proxy reward operate to align with long-term member satisfaction. It’s much like function engineering, besides that it may be derived from information that isn’t obtainable at serving time. Reward engineering entails 4 phases: speculation formation, defining a brand new proxy reward, coaching a brand new bandit coverage, and A/B testing. Under is an easy instance.

Consumer suggestions used within the proxy reward operate is commonly delayed or lacking. For instance, a member might determine to play a really useful present for only a few minutes on the primary day and take a number of weeks to totally full the present. This completion suggestions is due to this fact delayed. Moreover, some consumer suggestions might by no means happen; whereas we might need in any other case, not all members present a thumbs-up or thumbs-down after finishing a present, leaving us unsure about their stage of enjoyment.

We may attempt to wait to provide an extended window to watch suggestions, however how lengthy ought to we look forward to delayed suggestions earlier than computing the proxy rewards? If we wait too lengthy (e.g., weeks), we miss the chance to replace the bandit coverage with the newest information. In a extremely dynamic surroundings like Netflix, a stale bandit coverage can degrade the consumer expertise and be notably dangerous at recommending newer objects.

Answer: predict lacking suggestions

We purpose to replace the bandit coverage shortly after making a suggestion whereas additionally defining the proxy reward operate primarily based on all consumer suggestions, together with delayed suggestions. Since delayed suggestions has not been noticed on the time of coverage coaching, we are able to predict it. This prediction happens for every coaching instance with delayed suggestions, utilizing already noticed suggestions and different related info as much as the coaching time as enter options. Thus, the prediction additionally will get higher as time progresses.

The proxy reward is then calculated for every coaching instance utilizing each noticed and predicted suggestions. These coaching examples are used to replace the bandit coverage.

However aren’t we nonetheless solely counting on noticed suggestions within the proxy reward operate? Sure, as a result of delayed suggestions is predicted primarily based on noticed suggestions. Nevertheless, it’s less complicated to purpose about rewards utilizing all suggestions straight. As an illustration, the delayed thumbs-up prediction mannequin could also be a posh neural community that takes under consideration all noticed suggestions (e.g., short-term play patterns). It’s extra simple to outline the proxy reward as a easy operate of the thumbs-up suggestions somewhat than a posh operate of short-term interplay patterns. It can be used to regulate for potential biases in how suggestions is offered.

The reward engineering diagram is up to date with an non-obligatory delayed suggestions prediction step.

Two kinds of ML fashions

It’s price noting that this method employs two kinds of ML fashions:

  • Delayed Suggestions Prediction Fashions: These fashions predict p(last suggestions | noticed feedbacks). The predictions are used to outline and compute proxy rewards for bandit coverage coaching examples. In consequence, these fashions are used offline in the course of the bandit coverage coaching.
  • Bandit Coverage Fashions: These fashions are used within the bandit coverage π(merchandise | consumer; r) to generate suggestions on-line and in real-time.

Improved enter options or neural community architectures typically result in higher offline mannequin metrics (e.g., AUC for classification fashions). Nevertheless, when these improved fashions are subjected to A/B testing, we frequently observe flat and even detrimental on-line metrics, which might quantify long-term member satisfaction.

This online-offline metric disparity normally happens when the proxy reward used within the suggestion coverage just isn’t totally aligned with long-term member satisfaction. In such circumstances, a mannequin might obtain increased proxy rewards (offline metrics) however end in worse long-term member satisfaction (on-line metrics).

However, the mannequin enchancment is real. One method to resolve that is to additional refine the proxy reward definition to align higher with the improved mannequin. When this tuning ends in optimistic on-line metrics, the mannequin enchancment will be successfully productized. See [1] for extra discussions on this problem.

On this publish, we offered an outline of our reward engineering efforts to align Netflix suggestions with long-term member satisfaction. Whereas retention stays our north star, it’s not simple to optimize straight. Subsequently, our efforts deal with defining a proxy reward that’s aligned with long-term satisfaction and delicate to particular person suggestions. Lastly, we mentioned the distinctive problem of delayed consumer suggestions at Netflix and proposed an method that has confirmed efficient for us. Seek advice from [2] for an earlier overview of the reward innovation efforts at Netflix.

As we proceed to enhance our suggestions, a number of open questions stay:

  • Can we study proxy reward operate routinely by correlating conduct with retention?
  • How lengthy ought to we look forward to delayed suggestions earlier than utilizing its predicted worth in coverage coaching?
  • How can we leverage Reinforcement Studying to additional align the coverage with long-term satisfaction?

[1] Deep learning for recommender systems: A Netflix case study. AI Journal 2021. Harald Steck, Linas Baltrunas, Ehtsham Elahi, Dawen Liang, Yves Raimond, Justin Basilico.

[2] Reward innovation for long-term member satisfaction. RecSys 2023. Gary Tang, Jiangwei Pan, Henry Wang, Justin Basilico.