Migrating Coverage Supply Engines with (virtually) No one Understanding | by Pinterest Engineering | Pinterest Engineering Weblog

Pinterest Engineering
Pinterest Engineering Blog

Jeremy Krach | Workers Safety Engineer, Platform Safety

A number of years in the past, Pinterest had a brief incident attributable to oversights within the coverage supply engine. This engine is the know-how that ensures a coverage doc written by a developer and checked into supply management is absolutely delivered to the manufacturing system evaluating that coverage, much like OPAL. This incident started a multi-year journey for our group to rethink coverage supply and migrate a whole lot of insurance policies to a brand new distribution mannequin. We shared particulars about our former coverage supply system in a convention speak from Kubecon 2019.

At a excessive degree, there are three essential architectural selections we’d wish to convey consideration to for this story.

Determine 1: Outdated coverage distribution structure, utilizing S3 and Zookeeper.
  1. Pinterest gives a wrapper service round OPA with a view to handle coverage distribution, agent configuration metrics, logging, and simplified APIs.
  2. Insurance policies have been fetched routinely by way of Zookeeper as quickly as a brand new model was revealed.
  3. Insurance policies lived in a shared Phabricator repository that was revealed by way of a CI workflow.

So the place did this go improper? Basically, unhealthy variations (50+ on the time) of each coverage have been revealed concurrently attributable to a foul decide to the coverage repository. These unhealthy variations have been revealed to S3, with new variations registered in Zookeeper and pulled instantly into manufacturing. This prompted lots of our inner providers to fail concurrently. Thankfully a fast re-run of our CI revealed identified good variations that have been (once more) pulled instantly into manufacturing.

This incident led a number of groups to start rethinking world configuration (like OPA coverage). Particularly, the Safety group and Visitors group at Pinterest started collaborating on a brand new configuration supply system that would offer a mechanism to outline deployment pipelines for configuration.

This weblog put up is targeted on how the Safety group moved a whole lot of insurance policies and dozens of shoppers from the Zookeeper mannequin to a safer, extra dependable, and extra configurable config deployment method.

The core configuration supply story right here isn’t the Safety group’s to inform — Pinterest’s Visitors group labored carefully with us to know our necessities, and that group was in the end liable for constructing out the core know-how to allow our integration.

Usually talking, the brand new configuration administration system works as follows:

  1. Config house owners create their configuration in a shared repository.
  2. Configs are grouped by service house owners into “artifacts” in a DSL in that repository.
  3. Artifacts are configured with a pipeline, additionally in a DSL in that repository. This defines which methods obtain the artifact and when.

Every pipeline defines a set of steps and a set of supply scopes for every step. These scopes are generated regionally on every system that want to retrieve a configuration. For instance, one may outline a pipeline that first delivers to the canary system after which the manufacturing system, (simplified right here):

The DSL additionally permits for configuration round how pipeline steps are promoted — automated (inside enterprise hours), automated (24×7), and handbook. It additionally permits for configuration of metric thresholds that should not be exceeded earlier than continuing to the subsequent step.

The precise distribution know-how shouldn’t be dissimilar to the unique structure. Now, as a substitute of publishing coverage in a worldwide CI job, every artifact (group of coverage and different configuration) has a devoted pipeline to outline the scope of supply and the triggers for the supply. This ensures every coverage rollout is remoted to simply that system and might have no matter deployment technique and security checks that the service proprietor deems applicable. A high-level structure may be seen beneath.

Determine 2: New coverage distribution structure, utilizing Config server/sidecar and devoted UI.

Part 1: Tooling and Stock

Earlier than we may start migrating insurance policies from a worldwide, instantaneous deployment mannequin to a focused, staged deployment mannequin, plenty of info wanted to be collected. Particularly, for every coverage file in our previous configuration repository we would have liked to establish:

  1. The service and Github group related to the coverage
  2. The methods utilizing the coverage
  3. The popular deploy order for the methods utilizing the coverage

Thankfully, most of this info was available from a handful of knowledge sources at Pinterest. Throughout this primary section of the migration, we developed a script to gather all this metadata about every coverage. This concerned: studying every coverage file to drag the related service title from a compulsory tag remark, fetching the Github group related to the service from our inner stock API, getting metrics for all methods with visitors for the coverage, and grouping these methods right into a tough classification based mostly on a couple of frequent naming conventions. As soon as this knowledge was generated, we exported it to Google sheets with a view to annotate it with some handbook tweaks. Particularly, some methods have been misattributed to house owners attributable to stale possession knowledge, and lots of methods didn’t observe commonplace, predictable naming conventions.

The subsequent piece of tooling we developed was a script that took a couple of items of enter: the trail to the coverage to be migrated, the group names, and the deployment steps. This routinely moved the coverage from the previous repository to the brand new one, generated an artifact that included the coverage, and outlined a deployment pipeline for the related methods attributed to the service proprietor.

With all this tooling in hand, we have been prepared to begin testing the migration tooling in opposition to some easy examples.

Part 2: Cutover Logic

Previous to the brand new coverage supply mannequin, groups would outline their coverage subscriptions in a config file managed by Telefig. One in all our targets for this migration was guaranteeing a seamless cutover that required minimal or no buyer adjustments. Because the new configuration administration offered the idea of scopes and outlined the coverage subscription within the configuration repository, we may rely purely on the brand new repository to outline the place insurance policies have been wanted. We wanted to replace our sidecar (the OPA wrapper) to generate subscription scopes regionally throughout start-up based mostly on system attributes. We selected to generate these scopes based mostly on the SPIFFE ID of the system, which allowed us to couple the deployments carefully to the service and setting of the host.

We additionally acknowledged that because the configuration system can ship arbitrary configs, we may additionally ship a configuration telling our OPA wrapper to modify its conduct. We applied this cutover logic as a hot-reload of configuration within the OPA wrapper. When a brand new configuration file was created, the OPA wrapper detects the brand new configuration and adjustments the next properties:

  1. The place the insurance policies are saved on disk (reload of the OPA runtime engine)
  2. How the insurance policies are up to date on disk (ZooKeeper subscription outlined by buyer managed configuration file vs. doing nothing and permitting the configuration sidecar to handle it)
  3. Metric tags, to permit detection of cutover progress
Determine 3: Flowchart of the coverage cutover logic.

One advantage of this method is that reverting the coverage distribution mechanism might be achieved fully within the new system. If a service didn’t work nicely with the brand new deployment system, we may use the brand new deployment system to replace the brand new configuration file to inform the OPA wrapper to make use of the legacy conduct. Switching between modes might be achieved seamlessly with no downtime or impression to clients utilizing insurance policies.

Since each the coverage setup and the cutover configuration may occur in a single repository, every coverage or service might be migrated with a single pull request with none want for buyer enter. All recordsdata within the new repository might be generated with our previously-built tooling. This set the stage for an extended collection of migrations with localized impression to solely the coverage being migrated.

At this level, the inspiration was laid to start the migration in earnest. Over the course of a month or two, we started auto-generating pull-requests scoped to single groups or coverage. Primarily Safety and Visitors group members generated and reviewed these PRs to make sure the deployments have been correctly scoped, related to the right groups, and rolled out efficiently.

As talked about earlier than, we had a whole lot of insurance policies that wanted to be migrated, so this was a gradual however lengthy means of shifting insurance policies in chunks. As we gained confidence in our tooling, we ramped up the variety of insurance policies migrated in a given PR from 1–2 to 10–20.

As with all plan, there have been some unexpected points that got here up as we deployed insurance policies to a extra numerous set of methods. What we discovered was that a few of our older stateful methods have been operating an older machine picture (AMI) that didn’t help subscription declaration. This introduced a direct roadblock for progress on methods that would not simply be relaunched.

Thankfully, our Steady Deployment group was actively revising how the Telefig service receives updates. We labored carefully with the CD group to make sure that we dynamically upgraded all methods at Pinterest to make use of the most recent model of Telefig. This unblocked our work and allowed us to proceed migrating the remaining use circumstances.

As soon as we resolved the previous Telefig model difficulty, we shortly labored with the few groups that owned the majority of the remaining insurance policies to get every little thing moved over into the brand new configuration deployment mannequin. Beneath is a tough timeline of the migration:

Determine 4: Timeline of the migration to the brand new coverage framework.

As soon as the metrics above stabilized at 100%, we started cleansing up the previous tooling. This allowed us to delete a whole lot of strains of code and drastically simplify the OPA wrapper, because it now not needed to construct in coverage distribution logic.

On the finish of this course of, we now have a safer coverage deployment platform that enables our groups to have full management over their deployment pipelines and absolutely isolate every deployment from insurance policies not included in that deployment.

Migrating issues is difficult. There’s at all times resistance to a brand new workflow, and the extra folks that should work together with it, the longer the tail on the migration. The principle takeaways from this migration are as follows.

Give attention to measurement first. With a view to keep on monitor, it’s worthwhile to know who might be impacted, the scope of what work stays, and what huge wins are behind you. Having good measurement additionally helps justify the mission and offers an awesome set of assets to brag about accomplishments at milestones alongside the way in which.

Secondly, migrations typically observe the Pareto Principle. Particularly, 20% of the use-cases to be migrated will typically account for 80% of the outcomes. That is seen within the timeline chart above — there are two enormous spikes in progress (one in mid April and one a couple of weeks later). These spikes are consultant of migrations for 2 groups, however they characterize an outsized proportion of the general standing. Hold this in thoughts when prioritizing which methods emigrate, as generally spending plenty of time simply emigrate one group or system may have a disproportionate payoff.

Lastly, anticipate points however be able to adapt. Spend time early within the course of pondering by means of your edge circumstances, however go away your self additional time on the roadmap to account for points that you may not predict. A bit little bit of buffer goes a great distance for peace of thoughts and should you occur to ship the outcomes early, that’s an awesome win to have fun!

This work wouldn’t have been potential with out an enormous group of individuals working collectively over the previous few years to construct the very best system potential.

Big due to our companions on the Visitors group for constructing out a sturdy configuration deployment system and onboarding us as the primary large-scale manufacturing use case. Particularly, due to Tian Zhao who led most of our collaboration and was instrumental in getting our use-case onboarded. Extra due to Zhewei Hu, James Fish and Scott Beardsley.

The safety group was additionally an enormous assist in reviewing the structure, migration plans and pull-requests. Particularly Teagan Todd was an enormous assist in operating many of those migrations. Additionally Yuping Li, Kevin Hock and Cedric Staub.

When encountering points with older methods, Anh Nguyen was a large assist in upgrading methods below the hood.

Lastly, thanks to our companions on groups that owned a considerable amount of insurance policies, as they helped us push the migration ahead by performing their very own migrations: Aneesh Nelavelly, Vivian Huang, James Fraser, Gabriel Raphael Garcia Montoya, Liqi Yi (He Him), Qi LI, Mauricio Rivera and Harekam Singh.

To study extra about engineering at Pinterest, take a look at the remainder of our Engineering Weblog and go to our Pinterest Labs site. To discover and apply to open roles, go to our Careers web page.