Versatile Steady Integration for iOS
How Airbnb leverages AWS, Packer, and Terraform to replace macOS on lots of of CI machines in hours as an alternative of days
By: Michael Bachand, Xianwen Chen
At Airbnb, we run a complete suite of steady integration (CI) jobs earlier than every iOS code change is merged. These jobs be certain that the primary department stays secure by executing crucial developer workflows like constructing the iOS utility and operating assessments. We additionally schedule jobs that carry out periodic duties like reporting metrics and importing artifacts.
Lots of our iOS CI jobs execute on Macs, which permits operating developer instruments supplied by Apple. CI jobs for all different platforms at Airbnb execute in containers on Amazon EC2 Linux situations. To meet the macOS requirement of iOS CI jobs we now have traditionally maintained alternate CI infrastructure outdoors of AWS particularly for iOS growth. The introduction of Macs to AWS supplied a possibility for us to rethink our strategy to iOS CI.
We designed the subsequent iteration of our iOS CI system in late 2021, completed the migration to the brand new system in mid 2022, and polished the system by way of the tip of 2022. CI for iOS and all different platforms at Airbnb already leveraged Buildkite for dispatching jobs. Now, we deploy iOS CI infrastructure to AWS utilizing Terraform, which helps align CI for iOS with CI for different platforms at Airbnb.
On this article, we’re excited to share with you particulars of the versatile and easy-to-maintain iOS CI system that we’ve carried out with Amazon EC2 Mac situations.
Traditionally we ran Airbnb iOS CI on bodily Macs. We loved the pace of operating CI with out virtualization however we paid a considerable upkeep value to run CI jobs straight on bodily {hardware}. An iOS infrastructure engineer individually logged into over 300 machines to carry out administrative duties like enrolling the Mac in our MDM (Cellular System Administration) instrument and upgrading macOS. Handbook upkeep necessities restricted the scalability of the fleet and consumed engineer time that may very well be higher spent on higher-value tasks.
Our previous CI machines have been not often restarted and too typically drifted into a foul state. When this occurred, the best-case state of affairs was that an engineer might log into the machine, diagnose what configuration drift was inflicting points, and manually convey the machine again to a superb state. Extra generally, we shut down the corrupted machine in order that it might not settle for new CI jobs. Periodically, we requested the seller who managed our bodily Macs to revive the corrupted machines to a clear set up of macOS. When the machines finally got here again on-line, we manually re-enrolled every machine in MDM to convey our fleet again to its full capability.
Updating to a brand new model of Xcode was fairly error-prone as effectively. We try to roll out new Xcode variations repeatedly since many iOS engineers at Airbnb observe Swift and Xcode releases intently and are desperate to undertake new language options and IDE enhancements. Nonetheless, the fastened capability of our Mac fleet made it troublesome for us to confirm iOS CI jobs completely towards new variations; any machine allotted to testing a brand new model of Xcode might not settle for CI jobs from the earlier Xcode model. The danger of tackling every Xcode replace was elevated by the truth that rolling again to a earlier model of Xcode throughout our fleet was not sensible.
When evaluating AWS, we have been excited by the opportunity of launching situations from Amazon Machine Pictures (AMIs). An AMI is a snapshot of an occasion’s state, together with its file system contents and different metadata. Amazon gives base AMIs for every macOS model and permits clients to create their very own AMIs from operating situations.
AMIs permit us so as to add new situations to our fleet with out human intervention. An EC2 Mac bare-metal occasion launched from a correctly configured AMI is instantly prepared to simply accept new work after initialization. When updating macOS, we not must log into each machine in our fleet. As a substitute, we log right into a single occasion launched from the Amazon base AMI for the brand new macOS model. After performing a handful of handbook configuration steps, like enabling automatic login, we create an Airbnb base AMI from that occasion.
Initially, we powered our EC2 Mac fleet with manually created AMIs. An engineer would configure a single occasion and create an AMI from that occasion’s state. Then we might launch any variety of further situations from that AMI. This was a serious enchancment over managing bodily machines since we might spin up a whole fleet of equivalent situations after configuring solely a single occasion efficiently.
Now, we build AMIs using Packer. Packer programmatically launches and configures an EC2 occasion utilizing a template outlined within the HashiCorp configuration language (HCL). Packer then creates an AMI from the configured EC2 occasion. A Ruby wrapper script invokes Packer persistently and performs useful validations like checking that the person has assumed the right AWS function. We verify the HCL template code into supply management and all adjustments to our Packer template and companion scripts are made through GitHub pull requests.
We initially ran Packer from developer laptops, however the laptop computer wanted to be awake and on-line throughout the Packer construct. Ultimately, we created a devoted pipeline to construct AMIs within the cloud. A developer can set off a brand new construct on this pipeline with a few clicks. A profitable construct will produce freshly baked and verified AMIs for each the x86 and Arm (Apple Silicon) CPU architectures inside just a few hours.
Our new CI system leveraging these AMIs consists of many environments, every of which may be managed independently. The central AWS part of every CI atmosphere is an Auto Scaling group, which is chargeable for launching the EC2 Mac situations. The variety of situations within the Auto Scaling group is set by the desired capacity property on the group and is bounded by min and max measurement properties.
An Auto Scaling group creates new situations utilizing a launch template. The launch template specifies the configuration of every occasion, together with the AMI, and permits a “person information” script to run when the occasion is launched. Launch templates may be versioned, and every Auto Scaling group is configured to launch situations from a selected model of its launch template.
Though the introduction of environments has made our CI topology extra complicated, we discover that complexity manageable when our infrastructure is outlined in code. All of our AWS infrastructure for iOS CI is laid out in Terraform code that we verify into supply management. Every time we merge a pull request associated to iOS CI, Terraform Enterprise will robotically apply our adjustments to our AWS account. We’ve got outlined a Terraform module that we will name each time we need to instantiate a brand new CI atmosphere.
An inner scaling service manages the specified capability of every atmosphere’s Auto Scaling group. This service, a modified fork of buildkite-agent-scaler, will increase the specified capability of an atmosphere’s Auto Scaling group as CI job quantity for that atmosphere will increase. We specify a most variety of situations for every CI atmosphere partially as a result of On-Demand EC2 Mac Devoted Hosts presently have a minimal host allocation and billing period of 24 hours.
Every CI atmosphere has a novel Buildkite queue title. Particular person CI jobs can goal situations in a selected atmosphere by specifying the corresponding queue title. Jobs will fall again to the default CI atmosphere when no queue title is explicitly specified.
CI Environments Are Extremely Versatile
With this new Terraform setup we’re capable of help an arbitrary variety of CI environments with minimal overhead. We create a brand new CI atmosphere per CPU structure and model of Xcode. We will even duplicate these environments throughout a number of variations of macOS when performing an working system replace throughout our fleet. We use devoted staging environments to check CI jobs on situations launched from a brand new AMI earlier than we roll out that AMI broadly.
After we are not repeatedly utilizing a CI atmosphere, we will specify a minimal capability of zero when calling the Terraform module, which is able to set the identical worth on the underlying Auto Scaling group. Then the Auto Scaling group will solely launch situations when its desired capability is elevated by the scaling service. In apply, we are inclined to delete older environments from our Terraform code. Nonetheless, even as soon as an atmosphere has been wound down, reinstating that atmosphere is so simple as reverting a few commits in Git and redeploying the scaling service.
Rotation of Situations Will increase CI Consistency
To reduce the chance for EC2 situations to float, we terminate all situations every night time and substitute them every day. This manner, we may be assured that our CI fleet is in a recognized good state in the beginning of every day.
When an occasion is terminated, the underlying Devoted Host is scrubbed earlier than a brand new occasion may be launched on that host. We terminate situations at a time when CI demand is low to permit for the EC2 Mac scrubbing course of to finish earlier than we have to launch recent situations on the identical hosts. When an occasion terminates itself in a single day, it can decrement the specified capability of the Auto Scaling group to which it belongs. As engineers begin pushing commits the subsequent day, the scaling service will increment the specified capability on the suitable Auto Scaling teams, inflicting new situations to be launched.
When an occasion does expertise configuration drift, we will disconnect that occasion from Buildkite with one click on. The occasion will stay operating however will not settle for new CI jobs. An engineer can log into the occasion to analyze its state till the occasion is finally terminated on the finish of the day. To maintain total CI capability secure, we will manually add a further occasion to our fleet, or a substitute might be launched robotically if we terminate the occasion early.
We Ship Xcode Variations Extra Rapidly
We respect the brand new capabilities of our upgraded CI system. We will lease further Devoted Hosts from Amazon on demand to climate surprising spikes in CI utilization and to check software program updates completely. We roll out new AMIs step by step and might roll again painlessly if we encounter surprising points.
Collectively, these capabilities get Airbnb iOS builders entry to Swift language options and Xcode IDE enhancements extra shortly. Actually, with the tailwind of our new CI system, we now have seen the tempo at which we replace Xcode enhance by over 20%. As of the time of writing, we now have internally rolled out all out there main and minor variations of Xcode 14 (14.0–14.3) as they’ve been launched.
Our new CI system ran over 10 million minutes of CI jobs within the final three months of 2022. After upgrading to EC2, we spend meaningfully fewer hours on upkeep regardless of a rising codebase and persistently excessive job quantity. Our newfound means to scale CI to fulfill the evolving wants of the Airbnb iOS neighborhood justifies the elevated complexity of the rebuilt system.
After the migration to AWS, iOS CI advantages extra from shared infrastructure that’s already getting used efficiently inside Airbnb. For instance, the brand new iOS CI structure enabled us to keep away from implementing an iOS-specific resolution for robotically scaling capability. As a substitute, we leverage the aforementioned fork of buildkite-agent-scaler that Airbnb engineers had already transformed to an inner Airbnb service full with a devoted deployment pipeline. Moreover, we used present Terraform modules which might be maintained by different groups to combine with IAM and SSM.
We’ve got discovered that EC2 Mac situations launched from customized AMIs present lots of the advantages of virtualization with out the efficiency penalty of executing inside a digital machine. We take into account AWS, Packer, and Terraform to be important applied sciences for constructing a versatile CI system for large-scale iOS growth in 2023.