Dealing with On-line-Offline Discrepancy in Pinterest Advertisements Rating System | by Pinterest Engineering | Pinterest Engineering Weblog

Pinterest Engineering
Pinterest Engineering Blog

13 min learn

Jan 18, 2024

Cathy Qian | Senior Machine Studying Engineer, Advertisements Rating Conversion Modeling; Aayush Mudgal| Employees Machine Studying Engineer, Advertisements Rating Conversion Modeling; Yinrui Li | Machine Studying II, Advertisements Rating Conversion Modeling; Jinfeng Zhuang | Senior Employees Machine Studying Engineer, Advertisements Rating Conversion Modeling; Shantam Shorewala | Software program Engineer II, Advertisements ML Serving Infra; Yiran Zhao| Employees Software program Engineer, Advertisements ML Serving Infra; Harshal Dahake, Software program Engineer, Advertisements ML Coaching Infra

Picture from

At Pinterest, our mission is to convey everybody the inspiration to create a life they love. Folks usually come to Pinterest when they’re contemplating what to do or purchase subsequent. Understanding this evolving person journey whereas balancing throughout a number of targets is essential to convey one of the best expertise to Pinterest customers and is supported by a number of advice fashions, with every offering real-time inference inside a number of lots of of milliseconds for each request. Specifically, our machine studying powered advertisements rating techniques are attempting to grasp customers’ engagement and conversion intent and promote the precise advertisements to the precise person on the proper time. Our engineers are always discovering new algorithms and new alerts to enhance the efficiency of our machine studying fashions. A typical improvement cycle entails offline mannequin coaching to understand offline mannequin metric beneficial properties after which on-line A/B experiments to quantify on-line metric actions. Nonetheless, it’s not unusual that offline metric beneficial properties don’t translate into on-line enterprise metric wins. On this weblog, we are going to concentrate on some on-line and offline discrepancies and improvement cycle learnings we’ve noticed in Pinterest advertisements conversion fashions, in addition to among the key platform investments Pinterest has made to reduce such discrepancies.

Throughout our machine studying mannequin iteration, we often implement and check mannequin enchancment offline after which use a candidate mannequin to serve on-line site visitors in A/B experiments to measure its affect on enterprise metrics. Nonetheless, we noticed that offline mannequin efficiency enchancment doesn’t all the time translate straight into on-line metric beneficial properties. Particularly, such discrepancies unfold into the next situations:

  • Bug-free state of affairs: Our advertisements rating system is working bug-free. We see each offline and on-line metric actions, however the correlation between these actions is just not clear. That is the place meticulous metric design is required.
  • Buggy state of affairs: We see diminished acquire or impartial outcomes, and even on-line loss for fashions the place we see offline beneficial properties, and we suspect one thing is just not working correctly in our advertisements rating system. That is the place funding in ML tooling helps to slim down the issue rapidly.

In bug-free situations, we discover that it’s difficult to translate offline mannequin efficiency metric beneficial properties to on-line enterprise metric actions quantitatively. Determine 1 plotted the relative on-line enterprise metric motion and offline mannequin metric motion for 15 main conversion mannequin iterations in 2023. Ten out of all of the 15 iterations confirmed constant directional motion between space beneath the ROC curve (ROC-AUC) and cost-per-acquisition (CPA). Eight of the experiments present statistically-significant actions. Quantitatively, if we make a linear plot out of all of the stat-sig information factors, we will see a downward development with larger AUC enhance indicating larger CPA discount basically. Nonetheless, given a selected AUC change, the expected CPA from the linear plot would endure from massive variance and thus a insecurity. For instance, after we see a cloth ROC-AUC change in our offline mannequin iteration, we’re not 100% assured that we’d see statistically vital CPA discount in on-line experiments, not to mention the quantity of CPA discount. Such online-offline discrepancy can affect our mannequin improvement velocity.

Determine 1. Relative on-line enterprise metric ( price per acquisition, aka CPA) motion versus offline mannequin efficiency metric (space beneath curve, aka ROC-AUC) motion between therapy fashions and management fashions. Grey triangles characterize non-statistically-significant information and inexperienced spheres characterize statistically-significant information.

There are a number of hypotheses that may clarify the noticed online-offline discrepancy, as elaborated beneath:

A. The offline mannequin analysis metrics aren’t properly aligned with the net enterprise metrics.

ROC-AUC is among the most important metrics used for offline mannequin analysis. The next ROC-AUC worth signifies the next likelihood of a real constructive occasion ranked larger than a real unfavorable occasion. The upper ROC-AUC, the higher the binary classifier is at separating constructive and unfavorable occasions. Nonetheless, the expected conversion possibilities are straight used for advertisements rating and bidding, whereas they aren’t straight associated to ROC-AUC; i.e., we will have respectable AUC however poor likelihood scores and vice versa.

However, CPA is the guardrail enterprise metric used for on-line mannequin efficiency analysis, and it’s outlined as:

CPA = (complete income) / (complete variety of conversions)

Right here, the whole variety of conversions Pinterest has is simply an approximation. Moreover, income calculation entails bidding and lots of different enterprise logics and thus can also be not simple. Consequently, the connection between on-line CPA discount and offline AUC acquire can hardly be derived utilizing easy mathematical formulation.

Moreover, AUC and CPA are compound metrics, and their fluctuations throughout varied site visitors segments can differ, thereby rising the problem in predicting their correlation. As an illustration, let’s suppose we’ve two distinct site visitors segments: A and B. If the AUC declines barely in phase A however will increase lots in phase B, and concurrently the CPA rises in phase A however declines in phase B at an identical scale, the mixed impact may doubtlessly end in an total enhance in AUC however a impartial or rising CPA.

B. Throughout on-line experiments, the management mannequin learns from the therapy mannequin site visitors and thus minimizes the anticipated on-line efficiency acquire.

After we do on-line experiments, the management mannequin is often the manufacturing mannequin that has been serving actual site visitors for months if not weeks. After we do offline experiments, each the management and therapy mannequin are skilled and evaluated on the site visitors that’s served by the management mannequin, aka the management site visitors. Throughout on-line experiments, the therapy mannequin is serving actual site visitors, aka the therapy site visitors. An related concern is the potential for the management mannequin to start studying from the handled site visitors, thereby minimizing the anticipated on-line advantages from the therapy mannequin. This poses a major problem and requires a exact partitioning of coaching and have pipeline to precisely seize the impact. Nonetheless, at Pinterest, because of the complexity of this course of and the comparatively minimal good thing about such bias discount, we at present don’t take particular steps concerning this concern.

C. Advanced downstream logic could dilute on-line beneficial properties. At Pinterest, ensemble mannequin scores are used to calculate the downstream utility scores for bidding functions whereas every mannequin is optimized independently offline. Because of this, the optimization dynamics between particular person fashions and the utility scores could not absolutely align, inflicting a dilution of the general on-line beneficial properties. We’re actively engaged on optimizing the downstream effectivity.

D. Sure unavoidable characteristic delays in on-line settings could end in diminished on-line efficiency acquire. In characteristic engineering iterations, we use present historic information to backfill new options. In on-line serving, these options are populated on a predetermined schedule. Nonetheless, our offline backfilling timeline will not be absolutely aligned with the net characteristic inhabitants timeline.

For instance, 7-day aggregated conversion counts are calculated utilizing the previous seven day conversion labels, which is absolutely out there after we do backfilling. At serving time, assuming immediately’s date is dt, we’d use the aggregated conversion counts between dt-7 and dt to generate this characteristic. Nonetheless, the characteristic aggregation pipeline could end at dt 3 am UTC, making this up to date characteristic solely out there for serving afterwards. Serving between dt 12 am UTC and three am UTC would use a stale model of this characteristic. Such delay at an hourly granular degree is unavoidable as a result of our characteristic aggregation pipeline wants time to get their jobs finished. Sensitivity of the therapy mannequin to such characteristic freshness could end in decreased on-line efficiency beneficial properties than anticipated.

E. We are able to’t drive enterprise metrics infinitely by optimizing machine studying fashions. As our fashions get higher over time, we anticipate the marginal enchancment in enterprise metrics, and sooner or later, such beneficial properties too small to be detected in on-line experiments.

These are all arduous challenges with out straightforward mitigations. Happily, we’re not the one ones to look at such online-offline discrepancies in large-scale advice techniques. Peer efforts in the direction of this problem will be summarized into the 2 following instructions:

  1. Depart it as it’s, however iterate quick and use offline outcomes as a wholesome examine [ref 2]
  2. Use a number of offline metrics as a substitute of a single one and do compound evaluation [ref 3]

At the moment, the advertisements rating workforce at Pinterest are actively investigating online-offline discrepancies within the bug-free state of affairs.

Right here we’ve summarized common advertisements mannequin move, each involving the offline mannequin coaching move and the net serving move.

Determine 2: Excessive Stage Overview of On-line Serving and Offline Mannequin Coaching Move

After we don’t see any on-line acquire and even on-line loss for fashions the place we see offline acquire, the very first thing we often do is to examine if one thing goes improper in our advertisements rating system. Virtually, something which will break and can break at a sure level both within the on-line and offline move or each. Right here we are going to go over among the frequent failure patterns that we’ve noticed and among the security measures that we’ve to debug and alert higher.

Knowledge, together with options and labels, is essential for coaching machine studying fashions. Nonetheless, varied components can corrupt this information in large-scale advice techniques. For instance: characteristic distribution could shift over time on account of both adjustments in upstream pipelines or enterprise development, characteristic protection could all of a sudden drop considerably on account of upstream pipeline delays or failures, or labels could also be delayed unexpectedly leading to false negatives in coaching information.

To detect such irregularities, we’ve integrated Characteristic Stats Validation checks into our mannequin coaching pipelines in addition to monitoring dashboards to seize characteristic protection, freshness, and distribution over time utilizing Apache Superset. On condition that we’ve 1000’s of options and never all options are related to each mannequin, we’ve additionally developed characteristic validation processes that target core options to particular fashions. This tailor-made strategy reduces pointless noise in alerts and allows extra environment friendly response and motion.

Moreover, the identical characteristic could have completely different values throughout coaching and serving time on account of asynchronous logging setup. Throughout on-line serving, the Advertisements server makes two sorts of requests to the ML inference server (rating service in Determine 2). The primary batch of requests is to rank all of the advert candidates utilizing completely different fashions and a single second request to log options for sure candidates reminiscent of public sale winners. To be able to fulfill each these requests, the ML server fetches options from both the native cache (if out there) or fetches the pre-computed worth from the characteristic shops. This might result in discrepancy between what was used for scoring versus what characteristic worth is logged. The workforce is engaged on two options to beat this serving-logging discrepancies. The primary is to construct a service that ingests and compares characteristic values used in the course of the serving and logging request made to the ML server.This service can emit and spotlight any worth discrepancies, enabling quicker detection of inconsistent information. Additional, the workforce is engaged on eradicating the inconsistency by unifying the serving-logging path such that each one rating and logging requests for every candidate used the identical cached worth for any given characteristic. Along with this, the logic behind characteristic backfilling could possibly be defective, resulting in potential information leakage points in offline experiments. Moreover, there’s a risk that the options extracted from the cache in the course of the serve time may not be up-to-date.

At Pinterest, our efforts to make sure excessive information consistency between coaching and serving, unfold into the next instructions:

  1. In earlier mannequin variations, every modeling use-case defines customized person outlined capabilities (UDFs) in C++ for characteristic extraction. This might create a mismatch between the coaching UDF binary and serving binary, inflicting discrepancies within the served mannequin. We’ve since transitioned to a Unified Characteristic Illustration (flattened characteristic definitions), making ML includes a precedence within the ML lifecycle. This shift permits many of the characteristic pre-processing logic to both be pushed to the characteristic retailer or managed by the mannequin coach, thereby decreasing such discrepancies.
  2. We do confidence checks on a number of days of ahead logged information and backfilled information to verify they’re constant earlier than continuing to backfilling a number of months of knowledge.
  3. We keep a small log of options on the mannequin server to check with our asynchronous logging pipeline. This permits us to check serving options with logged options and simply detect any discrepancies on account of failed cache or logging logic.
Determine 3: Evolution in the direction of Unified Characteristic Illustration for standardization throughout differrent ML use-cases

Throughout coaching, earlier mannequin checkpoints are retrieved from S3, and alerts are fetched and processed earlier than feeding into the algorithm for steady mannequin coaching. When the coaching job finishes, the skilled mannequin checkpoint is saved to the S3 bucket and later fetched throughout mannequin serving. Doable points we could encounter within the above course of embrace:

  1. Mannequin checkpoint corruption
  2. S3 entry points on account of community failure or different points
  3. Mannequin coaching getting caught on account of infrastructure points, e.g. inadequate GPUs
  4. Mannequin staleness on account of delayed or damaged mannequin coaching pipelines
  5. Serving failure on account of excessive site visitors load or excessive serving latency

Just like options for diagnosing information points, ample monitoring and alerting techniques are wanted to permit for fast failure analysis and detection in mannequin coaching and serving pipelines. At Pinterest, we’ve constructed easily-scalable instruments in our new MLEnv coaching framework [ref 4] that may now be utilized simply throughout completely different verticals throughout mannequin offline coaching and on-line serving. Particularly, we’ve developed batch inference capabilities to take any mannequin checkpoint and make inference towards logged options. This permits us not solely to replay site visitors and examine for discrepancy, but additionally simulate completely different situations, e.g.,lacking or corrupted options, for mannequin stability examine.

Past the system-related information problems described above, the standard of the labels in conversion prediction contexts is commonly reliant on the advertisers on condition that conversion occasions happen on their platforms. We’ve seen circumstances of inaccurate conversion volumes, atypical conversion traits in information, and conversion label loss on account of rising privateness issues. To mitigate these points, we emphasize the detection and exclusion of outlier advertisers from each mannequin coaching and analysis. We even have a devoted workforce engaged on enhancing conversion information high quality; for instance, optimizing the view-through checkout attribution time window for higher conversion mannequin coaching. We’re additionally closely investing in mannequin coaching that’s resilient towards noisy and lacking labels.

Right here, we wish to share a case the place we found vital therapy mannequin efficiency deterioration throughout a web based experiment and the way we discovered the basis trigger.

First, we examined the mannequin coaching pipeline and realized it was working easily. Then we evaluated the efficiency of the management and the therapy mannequin on latest testing information and realized the relative distinction between these metrics aligned properly with our expectation.

Subsequent, we investigated whether or not the therapy mannequin generated the identical predictions on-line and offline for a similar enter. At Pinterest, information used for offline mannequin coaching are saved on AWS S3, whereas information used for on-line mannequin serving are fetched from cache or characteristic shops. So it’s potential that discrepancies exist in these two information sources on account of S3 failure or stale cache. Nonetheless, our evaluation confirmed that offline and on-line predictions from the therapy mannequin had been constant on an aggregated degree and thus this risk was dominated out. Be aware that Pinterest depends on an asynchronous name to log options used for mannequin serving and minor online-offline characteristic discrepancies could happen on account of steady characteristic updates however not vital ones.

Subsequently, we opted to delve into the patterns prevalent in irregular occurrences by assessing the timeline related to these points. Our evaluation divulged that these issues sometimes synchronized with bouts of peak site visitors, impacting our experimental mannequin for a length of two–3 minutes. In creating simulations of the options, we unearthed that the basis of the problem lay within the elevated question per second (QPS) throughout peak site visitors durations. This led our characteristic servers to offer ‘Null’ values for requests that would not be accomplished in time. This truth underscores the essential want for manufacturing fashions to be sufficiently strong to withstand unpredictable fluctuations and be sure that predictions don’t go awry in such outlier situations.

Moreover, we undertook initiatives to discover and refine strategies to judge and increase the robustness of novel mannequin architectures. We gauged their efficiency in conditions characterised by lacking options and spearheaded new methods to preclude the potential for a knowledge explosion.

In abstract, we’ve explored the what, why, and the way of online-offline discrepancies present inside Pinterest’s large-scale advert rating techniques, each in relation to bug-free and buggy situations. We proposed a number of hypotheses/learnings which have helped us unravel the incidence of discrepancies in a bug-free state of affairs, with crucial one being the discord between offline mannequin analysis metrics and on-line enterprise metrics. We’ve moreover pinpointed potential points that would come up in such large-scale machine studying techniques and shared strategies to diagnose such failure successfully. To be able to present tangible context, we’ve supplied an in depth case examine on a real-life Pinterest difficulty and demonstrated how we resolved the incident sequentially. It’s our hope that our work contributes some precious insights in the direction of the decision of online-offline discrepancies in sizable machine studying purposes which, in flip, may expedite the evolution of future machine studying options.

This work represents a results of collaboration of the advertisements rating conversion modeling workforce members and throughout a number of groups at Pinterest.

Engineering Groups:

  • Advertisements Rating: Han Solar, Hongda Shen, Ke Xu, Kungang Li, Matt Meng, Meng Mei, Meng Qi, Qifei Shen, Runze Su, Yiming Wang
  • Advertisements ML Infra: Haoyang Li, Haoyu He, Joey Wang, Kartik Kapur, Matthew Jin
  • Advertisements Knowledge Science: Adriaan ten Kate, Lily Liu

Management: Behnam Rezaei, Ling Leng, Shu Zhang, Zhifang Liu, Durmus Karatay

  1. Studying and Evaluating Classifiers beneath Pattern Choice Bias Link
  2. 150 Successful Machine Learning Models | Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
  3. Predictive model performance: offline and online evaluations
  4. MLEnv: Standardizing ML at Pinterest Beneath One ML Engine to Speed up Innovation