Supporting Numerous ML Programs : Netflix Tech Weblog

David J. Berg, Romain Cledat, Kayla Seeley, Shashank Srikanth, Chaoying Wang, Darin Yu

Netflix makes use of knowledge science and machine studying throughout all aspects of the corporate, powering a variety of enterprise purposes from our inner infrastructure and content material demand modeling to media understanding. The Machine Studying Platform (MLP) group at Netflix supplies a whole ecosystem of instruments round Metaflow, an open supply machine studying infrastructure framework we began, to empower knowledge scientists and machine studying practitioners to construct and handle a wide range of ML methods.

Since its inception, Metaflow has been designed to offer a human-friendly API for constructing knowledge and ML (and right this moment AI) purposes and deploying them in our manufacturing infrastructure frictionlessly. Whereas human-friendly APIs are pleasant, it’s actually the integrations to our manufacturing methods that give Metaflow its superpowers. With out these integrations, initiatives could be caught on the prototyping stage, or they must be maintained as outliers outdoors the methods maintained by our engineering groups, incurring unsustainable operational overhead.

Given the very various set of ML and AI use instances we help — right this moment we have now a whole lot of Metaflow initiatives deployed internally — we don’t count on all initiatives to comply with the identical path from prototype to manufacturing. As a substitute, we offer a sturdy foundational layer with integrations to our company-wide knowledge, compute, and orchestration platform, in addition to varied paths to deploy purposes to manufacturing easily. On prime of this, groups have constructed their very own domain-specific libraries to help their particular use instances and desires.

On this article, we cowl a number of key integrations that we offer for varied layers of the Metaflow stack at Netflix, as illustrated above. We may also showcase real-life ML initiatives that depend on them, to offer an thought of the breadth of initiatives we help. Word that each one initiatives leverage a number of integrations, however we spotlight them within the context of the mixing that they use most prominently. Importantly, all of the use instances have been engineered by practitioners themselves.

These integrations are carried out by way of Metaflow’s extension mechanism which is publicly out there however topic to alter, and therefore not part of Metaflow’s steady API but. In case you are inquisitive about implementing your individual extensions, get in contact with us on the Metaflow community Slack.

Let’s go over the stack layer by layer, beginning with probably the most foundational integrations.

Our most important knowledge lake is hosted on S3, organized as Apache Iceberg tables. For ETL and different heavy lifting of information, we primarily depend on Apache Spark. Along with Spark, we need to help last-mile knowledge processing in Python, addressing use instances reminiscent of function transformations, batch inference, and coaching. Sometimes, these use instances contain terabytes of information, so we have now to concentrate to efficiency.

To allow quick, scalable, and strong entry to the Netflix knowledge warehouse, we have now developed a Quick Knowledge library for Metaflow, which leverages high-performance elements from the Python knowledge ecosystem:

As depicted within the diagram, the Quick Knowledge library consists of two most important interfaces:

  • The Desk object is chargeable for interacting with the Netflix knowledge warehouse which incorporates parsing Iceberg (or legacy Hive) desk metadata, resolving partitions and Parquet recordsdata for studying. Just lately, we added help for the write path, so tables may be up to date as properly utilizing the library.
  • As soon as we have now found the Parquet recordsdata to be processed, MetaflowDataFrame takes over: it downloads knowledge utilizing Metaflow’s high-throughput S3 shopper on to the method’ reminiscence, which often outperforms reading of local files.

We use Apache Arrow to decode Parquet and to host an in-memory illustration of information. The person can select probably the most appropriate instrument for manipulating knowledge, reminiscent of Pandas or Polars to make use of a dataframe API, or one among our inner C++ libraries for varied high-performance operations. Due to Arrow, knowledge may be accessed by way of these libraries in a zero-copy vogue.

We additionally take note of dependency points: (Py)Arrow is a dependency of many ML and knowledge libraries, so we don’t need our customized C++ extensions to rely on a particular model of Arrow, which may simply result in unresolvable dependency graphs. As a substitute, within the type of nanoarrow, our Quick Knowledge library solely depends on the stable Arrow C data interface, producing a hermetically sealed library with no exterior dependencies.

Instance use case: Content material Information Graph

Our information graph of the leisure world encodes relationships between titles, actors and different attributes of a movie or sequence, supporting all elements of enterprise at Netflix.

A key problem in making a information graph is entity decision. There could also be many alternative representations of barely completely different or conflicting details about a title which should be resolved. That is sometimes completed by way of a pairwise matching process for every entity which turns into non-trivial to do at scale.

This undertaking leverages Quick Knowledge and horizontal scaling with Metaflow’s foreach construct to load massive quantities of title info — roughly a billion pairs — saved within the Netflix Knowledge Warehouse, so the pairs may be matched in parallel throughout many Metaflow duties.

We use metaflow.Desk to resolve all enter shards that are distributed to Metaflow duties that are chargeable for processing terabytes of information collectively. Every job hundreds the information utilizing metaflow.MetaflowDataFrame, performs matching utilizing Pandas, and populates a corresponding shard in an output Desk. Lastly, when all matching is completed and knowledge is written the brand new desk is dedicated so it may be learn by different jobs.

Whereas open-source customers of Metaflow depend on AWS Batch or Kubernetes as the compute backend, we depend on our centralized compute-platform, Titus. Beneath the hood, Titus is powered by Kubernetes, nevertheless it supplies a thick layer of enhancements over off-the-shelf Kubernetes, to make it extra observable, safe, scalable, and cost-efficient.

By concentrating on @titus, Metaflow duties profit from these battle-hardened options out of the field, with no in-depth technical information or engineering required from the ML engineers or knowledge scientist finish. Nevertheless, with the intention to profit from scalable compute, we have to assist the developer to bundle and rehydrate the entire execution atmosphere of a undertaking in a distant pod in a reproducible method (ideally shortly). Particularly, we don’t need to ask builders to handle Docker photos of their very own manually, which shortly ends in extra issues than it solves.

That is why Metaflow provides support for dependency management out of the field. Initially, we supported solely @conda, however based mostly on our work on Portable Execution Environments, open-source Metaflow gained support for @pypi a number of months in the past as properly.

Instance use case: Constructing mannequin explainers

Right here’s an interesting instance of the usefulness of transportable execution environments. For a lot of of our purposes, mannequin explainability issues. Stakeholders like to know why fashions produce a sure output and why their habits adjustments over time.

There are a number of methods to offer explainability to fashions however a method is to coach an explainer mannequin based mostly on every educated mannequin. With out going into the main points of how that is completed precisely, suffice to say that Netflix trains a number of fashions, so we have to prepare a number of explainers too.

Due to Metaflow, we will enable every software to decide on the most effective modeling strategy for his or her use instances. Correspondingly, every software brings its personal bespoke set of dependencies. Coaching an explainer mannequin due to this fact requires:

  1. Entry to the unique mannequin and its coaching atmosphere, and
  2. Dependencies particular to constructing the explainer mannequin.

This poses an attention-grabbing problem in dependency administration: we’d like a higher-order coaching system, “Explainer stream” within the determine under, which is ready to take a full execution atmosphere of one other coaching system as an enter and produce a mannequin based mostly on it.

Explainer stream is event-triggered by an upstream stream, such Mannequin A, B, C flows within the illustration. The build_environment step makes use of the metaflow atmosphere command supplied by our portable environments, to construct an atmosphere that features each the necessities of the enter mannequin in addition to these wanted to construct the explainer mannequin itself.

The constructed atmosphere is given a novel title that depends upon the run identifier (to offer uniqueness) in addition to the mannequin kind. Given this atmosphere, the train_explainer step is then capable of consult with this uniquely named atmosphere and function in an atmosphere that may each entry the enter mannequin in addition to prepare the explainer mannequin. Word that, not like in typical flows utilizing vanilla @conda or @pypi, the transportable environments extension permits customers to additionally fetch these environments immediately at execution time versus at deploy time which due to this fact permits customers to, as on this case, resolve the atmosphere proper earlier than utilizing it within the subsequent step.

If knowledge is the gas of ML and the compute layer is the muscle, then the nerves should be the orchestration layer. We now have talked in regards to the significance of a production-grade workflow orchestrator within the context of Metaflow after we launched help for AWS Step Features years in the past. Since then, open-source Metaflow has gained help for Argo Workflows, a Kubernetes-native orchestrator, in addition to support for Airflow which continues to be broadly utilized by knowledge engineering groups.

Internally, we use a manufacturing workflow orchestrator referred to as Maestro. The Maestro put up shares particulars about how the system helps scalability, high-availability, and usefulness, which give the spine for all of our Metaflow initiatives in manufacturing.

A massively vital element that usually goes missed is event-triggering: it permits a group to combine their Metaflow flows to surrounding methods upstream (e.g. ETL workflows), in addition to downstream (e.g. flows managed by different groups), utilizing a protocol shared by the entire group, as exemplified by the instance use case under.

Instance use case: Content material determination making

Some of the business-critical methods operating on Metaflow helps our content material determination making, that’s, the query of what content material Netflix ought to deliver to the service. We help an enormous scale of over 260M subscribers spanning over 190 nations representing massively various cultures and tastes, all of whom we need to delight with our content material slate. Reflecting the breadth and depth of the problem, the methods and fashions specializing in the query have grown to be very subtle.

We strategy the query from a number of angles however we have now a core set of information pipelines and fashions that present a basis for determination making. For example the complexity of simply the core elements, contemplate this high-level diagram:

On this diagram, grey containers characterize integrations to associate groups downstream and upstream, inexperienced containers are varied ETL pipelines, and blue containers are Metaflow flows. These containers encapsulate a whole lot of superior fashions and complex enterprise logic, dealing with huge quantities of information day by day.

Regardless of its complexity, the system is managed by a comparatively small group of engineers and knowledge scientists autonomously. That is made attainable by a number of key options of Metaflow:

The group has additionally developed their very own domain-specific libraries and configuration administration instruments, which assist them enhance and function the system.

To provide enterprise worth, all our Metaflow initiatives are deployed to work with different manufacturing methods. In lots of instances, the mixing may be through shared tables in our knowledge warehouse. In different instances, it’s extra handy to share the outcomes through a low-latency API.

Notably, not all API-based deployments require real-time analysis, which we cowl within the part under. We now have a lot of business-critical purposes the place some or all predictions may be precomputed, guaranteeing the bottom attainable latency and operationally easy excessive availability on the world scale.

We now have developed an formally supported sample to cowl such use instances. Whereas the system depends on our inner caching infrastructure, you would comply with the identical sample utilizing companies like Amazon ElasticCache or DynamoDB.

Instance use case: Content material efficiency visualization

The historic efficiency of titles is utilized by determination makers to know and enhance the movie and sequence catalog. Efficiency metrics may be complicated and are sometimes greatest understood by people with visualizations that break down the metrics throughout parameters of curiosity interactively. Content material determination makers are outfitted with self-serve visualizations by way of a real-time net software constructed with metaflow.Cache, which is accessed by way of an API supplied with metaflow.Internet hosting.

A day by day scheduled Metaflow job computes combination portions of curiosity in parallel. The job writes a big quantity of outcomes to a web based key-value retailer utilizing metaflow.Cache. A Streamlit app homes the visualization software program and knowledge aggregation logic. Customers can dynamically change parameters of the visualization software and in real-time a message is distributed to a easy Metaflow hosting service which seems up values within the cache, performs computation, and returns the outcomes as a JSON blob to the Streamlit software.

For deployments that require an API and real-time analysis, we offer an built-in mannequin internet hosting service, Metaflow Internet hosting. Though particulars have developed lots, this old talk still gives a good overview of the service.

Metaflow Internet hosting is particularly geared in direction of internet hosting artifacts or fashions produced in Metaflow. This supplies a simple to make use of interface on prime of Netflix’s current microservice infrastructure, permitting knowledge scientists to shortly transfer their work from experimentation to a manufacturing grade net service that may be consumed over a HTTP REST API with minimal overhead.

Its key advantages embrace:

  • Easy decorator syntax to create RESTFull endpoints.
  • The back-end auto-scales the variety of situations used to again your service based mostly on site visitors.
  • The back-end will scale-to-zero if no requests are made to it after a specified period of time thereby saving price significantly in case your service requires GPUs to successfully produce a response.
  • Request logging, alerts, monitoring and tracing hooks to Netflix infrastructure

Think about the service just like managed mannequin internet hosting companies like AWS Sagemaker Model Hosting, however tightly built-in with our microservice infrastructure.

Instance use case: Media

We now have an extended historical past of utilizing machine studying to course of media belongings, as an illustration, to personalize paintings and to assist our creatives create promotional content material effectively. Processing massive quantities of media belongings is technically non-trivial and computationally costly, so over time, we have now developed loads of specialised infrastructure devoted for this goal generally, and infrastructure supporting media ML use instances particularly.

To exhibit the advantages of Metaflow Internet hosting that gives a general-purpose API layer supporting each synchronous and asynchronous queries, contemplate this use case involving Amber, our function retailer for media.

Whereas Amber is a function retailer, precomputing and storing all media options prematurely could be infeasible. As a substitute, we compute and cache options in an on-demand foundation, as depicted under:

When a service requests a function from Amber, it computes the function dependency graph after which sends a number of asynchronous requests to Metaflow Internet hosting, which locations the requests in a queue, ultimately triggering function computations when compute sources turn out to be out there. Metaflow Internet hosting caches the response, so Amber can fetch it after some time. We may have constructed a devoted microservice only for this use case, however due to the pliability of Metaflow Internet hosting, we have been capable of ship the function quicker with no extra operational burden.

Our urge for food to use ML in various use instances is barely growing, so our Metaflow platform will hold increasing its footprint correspondingly and proceed to offer pleasant integrations to methods constructed by different groups at Netlfix. As an example, we have now plans to work on enhancements within the versioning layer, which wasn’t coated by this text, by giving extra choices for artifact and mannequin administration.

We additionally plan on constructing extra integrations with different methods which can be being developed by sister groups at Netflix. For instance, Metaflow Internet hosting fashions are at the moment not properly built-in into mannequin logging amenities — we plan on engaged on bettering this to make fashions developed with Metaflow extra built-in with the suggestions loop important in coaching new fashions. We hope to do that in a pluggable method that may enable different customers to combine with their very own logging methods.

Moreover we need to provide extra methods Metaflow artifacts and fashions may be built-in into non-Metaflow environments and purposes, e.g. JVM based mostly edge service, in order that Python-based knowledge scientists can contribute to non-Python engineering methods simply. This could enable us to raised bridge the hole between the short iteration that Metaflow supplies (in Python) with the necessities and constraints imposed by the infrastructure serving Netflix member going through requests.

In case you are constructing business-critical ML or AI methods in your group, join the Metaflow Slack community! We’re completely happy to share experiences, reply any questions, and welcome you to contribute to Metaflow.


Due to Wenbing Bai, Jan Florjanczyk, Michael Li, Aliki Mavromoustaki, and Sejal Rai for assist with use instances and figures. Due to our OSS contributors for making Metaflow a greater product.