Person Motion Sequence Modeling for Pinterest Adverts Engagement Modeling | by Pinterest Engineering | Pinterest Engineering Weblog | Mar, 2024

Pinterest Engineering
Pinterest Engineering Blog

Yulin Lei | Senior Machine Studying Engineer; Kaili Zhang | Workers Machine Studying Engineer; Sharare Zahtabian | Machine Studying Engineer II; Randy Carlson | Machine Studying Engineer I; Qifei Shen | Senior Workers Machine Studying Engineer

Pinterest strives to ship high-quality advertisements and preserve a optimistic consumer expertise. The platform goals to point out advertisements that align with the consumer’s pursuits and intentions, whereas additionally offering them with inspiration and discovery. The Adverts Engagement Modeling crew at Pinterest performs an important position in delivering efficient promoting campaigns and serving to companies attain their target market in a significant means. The objective of the engagement modeling is to point out customers probably the most related and fascinating advertisements based mostly on their pursuits and preferences. To ship a personalised and fulfilling advert expertise for its customers, the Engagement Modeling crew constructed deep neural community (DNN) fashions to constantly study and adapt to consumer suggestions and conduct, making certain that the advertisements proven are extremely focused and useful to the consumer.

Customized advice is vital within the advertisements advice system as a result of it could higher seize customers’ pursuits, join the customers with the compelling merchandise, and maintain them engaged with the platform. To make the advertisements Click on-through fee (CTR) predictions extra personalised, our crew has adopted customers’ actual time conduct histories and utilized deep studying algorithms to advocate acceptable advertisements to customers.

On this weblog put up, we are going to primarily talk about how we undertake the consumer sequence options and the followup optimization:

  • Designed the sequence options
  • Leveraged Transformer for sequence modeling
  • Improved the serving effectivity by half precision inference

We will even share the way to enhance the mannequin stability by Resilient Batch Norm.

To assist the engagement fashions study customers’ suggestions and pursuits, we developed consumer sequence options, which included customers’ actual time and historic engagement occasions and the associated info. We outlined sequence options from two important features: characteristic varieties and have attributes.

Function Varieties: Normally customers work together with natural content material or promoted Pins, which each point out customers’ intent and curiosity. Natural pins mirror customers’ normal pursuits, whereas promoted pins mirror customers’ pursuits on gross sales, merchandise, and so forth. So we created two consumer sequence options: one with all engaged pins, and one with advertisements solely. It turned out that each sequence options had sizable positive aspects by way of offline mannequin efficiency. We additionally developed consumer search sequence options, that are additionally very informative and helpful, particularly for search advertisements.

Function Attributes: In addition to what sequence options to construct, it’s also vital to deal with what to incorporate within the sequence. A sequence of consumer exercise is a well-liked design selection, and our consumer sequence is basically a sequence of user-engaged occasion representations together with timestamps, merchandise illustration, id options, and taxonomy options. At Pinterest, a pre-trained embedding (GraphSage) is often used for merchandise illustration in lots of fashions. We additionally use it because the merchandise illustration in our sequence options.

As soon as we have now the consumer sequences, with a view to develop efficient sequence modeling strategies, we discover a spread of architectures.

Transformer [1]: One broadly used method is the Transformer, which serves as our baseline. We begin with a single layer single head Transformer and embody place embeddings based mostly on time delta for every occasion within the sequence. We discover that rising the variety of layers ends in improved efficiency, whereas rising the variety of heads doesn’t present further positive aspects.

Determine 1: Transformer Structure

Function Connection: We additionally experiment with totally different strategies for connecting options inside every occasion, resembling concatenation and sum. Each approaches show efficient in sure situations. The benefit of the sum connection is that it permits us to manage the dimensionality of every occasion, making the computation of self-attention within the Transformer sooner when utilizing a small fastened dimension.

Extra Function Interplay: A normal apply when utilizing Transformer on modeling consumer sequence is to first embed all the sequence right into a vector, then use this vector to work together with different options. Nonetheless, early stage characteristic interplay is important for rating fashions. Thus, we introduce extra characteristic interactions between all the sequence with consumer and pin facet representations. We calculate the cosine similarity between further options with every occasion and use them as attributes for occasions. We additionally incorporate the consumer and pin facet representations instantly into the self-attention calculations.

Sum Pooling: When it comes to pooling strategies, we experiment with sum pooling, which is historically utilized in consumer sequence modeling resulting from its effectivity. We additionally develop a brand new method known as interval sum pooling, the place we divide the sequence into a number of intervals and apply sum pooling to every interval. The outcomes are then concatenated to generate the ultimate illustration of the sequence. In some situations, interval sum pooling outperforms the Transformer baseline.

Determine 2: Sum Pooling

Deep Curiosity Community (DIN) [2]: Though we additionally discover the DIN, a well-liked structure launched in 2018, we discover that it doesn’t surpass the efficiency of the beforehand talked about fashions.

Lengthy-Brief Curiosity: Recognizing that customers’ long-term and short-term pursuits could differ, we individually mannequin each features. The great sequence represents the long-term pursuits, whereas the newest eight occasions are thought-about the short-term pursuits. For the short-term sequences, we apply a light-weight consideration mechanism just like DIN. This permits us to seize customers’ newest pursuits adjustments whereas nonetheless contemplating their longer-term patterns.

Determine 3: Lengthy-Brief Curiosity Module

Total, by combining totally different architectures in varied on-line manufacturing fashions, we obtain vital efficiency enhancements in all situations.

The brand new structure has extra modules and bigger layers, making it value extra to serve. Whereas there are various alternatives for optimization, some of the notable ones is combined precision inference.

The GPUs we use for serving have tensor cores. Tensor cores are specialised in a single factor: fused matrix multiply and add, however solely with sure information varieties. Our present fashions use the pytorch default float32 datatype, however tensor cores don’t function on this. To get an inference speedup, we have to use a lower-precision information sort, of which pytorch provides two simple choices: float16 and bfloat16. Each of those information varieties use 16 bits as a substitute of 32 to characterize a quantity, however they’ve totally different tradeoffs between vary and precision. Float16 has a balanced discount in each vary and precision, whereas bfloat16 has practically the identical vary as float32, however much-reduced precision. We needed to discover which of those information varieties has higher efficiency in our mannequin and guarantee that it’s steady.

Resulting from each 16-bit varieties having decrease precision, we need to maintain as a lot of our mannequin as attainable in float32 in order to not danger prediction high quality, however we nonetheless need to get good reductions in inference time. We discovered that a lot of the largest layers had room for enchancment, whereas a variety of the smaller layers didn’t have an effect on inference time sufficient to make a distinction.

For these bigger layers, we tried each information varieties. The principle pitfall of float16 is that because of the decreased vary, it’s simple for the mannequin to overflow to “infinity.” We discovered that considered one of our important layers, the DCNv2 cross layer, was typically overflowing throughout coaching with float16. This may be mitigated by tuning some hyperparameters (e.g. weight decay), however a slight danger would nonetheless stay, and a failure mode of “full failure, no rating predicted” just isn’t superb.

The principle pitfall of bfloat16 is that because of the decreased precision, the mannequin could have marginally worse predictions. Empirically, we discovered that our mannequin can deal with this simply wonderful; there was no discount in mannequin accuracy. There may be additionally a good thing about a greater failure mode: “degraded prediction” is preferable to “no prediction.” Based mostly on our outcomes, we chosen bfloat16 for the big layers of our mannequin.

Lastly, there was the benchmarking. In offline testing, we discovered a 30% discount in mannequin inference time, with the identical prediction accuracy. This inference time discount translated nicely into manufacturing, and we obtained a major discount in infrastructure prices for our fashions.

Bettering the soundness and coaching pace of deep studying fashions is a vital job. To sort out this problem, Batch Normalization (Batch Norm) has grow to be a well-liked normalization technique utilized by many practitioners. At Pinterest, we leverage Batch Norm together with different normalization strategies like minmax clip, log norm, and layer norm to successfully normalize our enter information. Nonetheless, we have now encountered instances the place Batch Norm itself can introduce mannequin instability.

Let’s take a better have a look at the system for Batch Norm and its underlying course of in the course of the ahead move.

Batch Norm has two learnable parameters, specifically beta and gamma, together with two non-learnable parameters, imply shifting common and variance shifting common. Right here’s how the Batch Norm layer operates:

  1. Calculate Imply and Variance: For each activation vector, compute the imply and variance of all of the values within the mini-batch.
  2. Normalize: Utilizing the corresponding imply and variance, calculate the normalized values for every activation characteristic vector.
  3. Scale and Shift: Apply an element, gamma, to the normalized values, and add an element, beta, to it.
  4. Shifting Common: Preserve an Exponential Shifting Common of the imply and variance.

Nonetheless, a problem arises when the variance in step 2 turns into extraordinarily small and even zero. In such situations, the normalized worth, y, turns into abnormally massive, resulting in a price explosion inside the mannequin. A number of widespread causes behind this extraordinarily small variance embody stale or delayed characteristic values, characteristic absence, and distribution shifts with low protection.To handle these points, we usually fill zeroes or use default values within the affected situations. Consequently, the variance computed in step 1 turns into zero. Whereas rising the mini-batch dimension and shuffling on the row stage can assist mitigate this drawback, they don’t absolutely clear up it. To beat the instability brought on by Batch Norm, we at Pinterest have developed an answer known as Resilient Batch Norm.

Resilient Batch Norm introduces two essential hyperparameters: minimal_variance and variance_shift_threshold. The ahead move in Resilient Batch Norm follows these steps:

  1. Calculate Imply and Variance for the mini-batch.
  2. Replace Shifting Common with particular situations:
  3. If a variance is smaller than the minimal_variance hyperparameter, masks out the column from the working variance replace.
  4. If a variance’s change ratio exceeds the variance_shift_threshold, masks out the column from the working variance replace.
  5. Proceed to replace the remaining working variance and working imply.
  6. Normalize utilizing the working variance and working imply.
  7. Scale and Shift.

After conducting intensive experiments, we have now noticed no lower in efficiency or coaching pace. By seamlessly changing Batch Norm with Resilient Batch Norm, our fashions achieve the power to handle the aforementioned characteristic issues and related conditions whereas reaching enhanced stability.

In conclusion, when confronted with instability points resulting from Batch Norm, adopting Resilient Batch Norm can present a strong answer and enhance the general efficacy of the fashions.

On this part, we present some offline and on-line outcomes for the consumer motion sequence mannequin on totally different view varieties (HomeFeed, RelatedPins, Search) and total. The baseline mannequin is our manufacturing mannequin with DCNv2 [3] structure and inside coaching information. It’s to be famous that 0.1% offline accuracy enchancment within the Engagement rating mannequin is taken into account to be vital. Thus, the consumer motion sequence options and modeling strategies improve each on-line and offline metrics very considerably.

By leveraging realtime consumer sequence options, using varied modeling strategies resembling transformers, characteristic interplay, characteristic connections, and pooling, the engagement mannequin at Pinterest has been capable of successfully adapt to customers’ conduct and suggestions, leading to extra personalised and related suggestions. The popularity of customers’ long-term and short-term pursuits has been instrumental in reaching this goal. So as to account for each features, a complete sequence is utilized to characterize long-term pursuits, whereas the newest eight occasions are employed to seize short-term pursuits. This method has considerably improved the mannequin’s prediction efficiency; nonetheless, it has come at a substantial value by way of the added options and complexity of the fashions.

To mitigate the affect on serving effectivity and infrastructure prices, we have now explored and applied combined precision inference strategies, using decrease precision (float16, bfloat16). This has successfully improved our serving effectivity whereas additionally decreasing infrastructure prices. Moreover, we have now addressed the problem of creating the mannequin resilient to realtime adjustments, as we acknowledged the vital significance of those realtime sequence options. By incorporating a extra resilient batch normalization method, we’re capable of stop irregular worth explosions brought on by sudden adjustments in characteristic protection or distribution shift.

On account of these endeavors, Pinterest continues to ship extremely fascinating, adaptive, and related suggestions that encourage and drive discovery for every distinctive consumer.

This work represents a results of collaboration of the conversion modeling crew members and throughout a number of groups at Pinterest.

Engineering Groups:

Adverts Rating: Van Wang, Ke Zeng, Han Solar, Meng Qi

Superior Expertise Group: Yi-Ping Hsu, Pong Eksombatchai, Xiangyi Chen

Adverts ML Infra: Shantam Shorewala, Kartik Kapur, Matthew Jin, Yiran Zhao, Dongyong Wang

Person Sequence Assist: Zefan Fu, Kimmie Hua

Indexing Infra: Kangnan Li, Dumitru Daniliuc

Management: Ling Leng, Dongtao Liu, Liangzhe Chen, Haoyang Li, Joey Wang, Shun-ping Chiu, Shu Zhang, Jiajing Xu, Xiaofang Chen, Yang Tang, Behnam Rezaei, Caijie Zhang

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural info processing methods 30 (2017).

[2] Zhou, Guorui, et al. “Deep interest network for click-through rate prediction.” Proceedings of the twenty fourth ACM SIGKDD worldwide convention on information discovery & information mining. 2018.

[3] Wang, Ruoxi, et al. “Dcn v2: Improved deep & cross network and practical lessons for web-scale learning to rank systems.” Proceedings of the net convention 2021. 2021.

To study extra about engineering at Pinterest, take a look at the remainder of our Engineering Weblog and go to our Pinterest Labs website. To discover and apply to open roles, go to our Careers web page.