Executing Cron Scripts Reliably At Scale
Cron scripts are answerable for crucial Slack performance. They guarantee reminders execute on time, electronic mail notifications are despatched, and databases are cleaned up, amongst different issues. Through the years, each the variety of cron scripts and the quantity of knowledge these scripts course of have elevated. Whereas usually these cron scripts executed as anticipated, over time the reliability of their execution has often faltered, and sustaining and scaling their execution surroundings turned more and more burdensome. These points lead us to design and construct a greater option to execute cron scripts reliably at scale.
Working cron scripts at Slack began in the best way you may anticipate. There was one node with a replica of all of the scripts to run and one crontab file with the schedules for all of the scripts. The node was answerable for executing the scripts regionally on their specified schedule. Over time, the variety of scripts grew, and the quantity of knowledge every script processed additionally grew. For some time, we might hold transferring to larger nodes with extra CPU and extra RAM; that stored issues working more often than not. However the setup nonetheless wasn’t that dependable — with one field working, any points with provisioning, rotation, or configuration would convey the service to a halt, taking some key Slack performance with it. After constantly including increasingly more patches to the system, we determined it was time to construct one thing new: a dependable and scalable cron execution service. This text will element some key elements and concerns of this new system.
System Parts
When designing this new, extra dependable service, we determined to leverage many present providers to lower the quantity we needed to construct — and thus the quantity we’ve to take care of going ahead. The brand new service consists of three foremost elements:
- A brand new Golang service known as the “Scheduled Job Conductor”, run on Bedrock, Slack’s wrapper round Kubernetes
- Slack’s Job Queue, an asynchronous compute platform that executes a excessive quantity of labor shortly and effectively
- A Vitess desk for job deduplication and monitoring, to create visibility round job runs and failures
Scheduled Job Conductor
The Golang service mimicked cron performance by leveraging a Golang cron library. The library we selected allowed us to maintain the identical cron string format that we used on the unique cron field, which made migration less complicated and fewer error inclined. We used Bedrock, Slack’s wrapper round Kubernetes, to permit us to scale up a number of pods simply. We don’t use all of the pods to course of jobs — as a substitute we use Kubernetes Chief Election to designate one pod to the scheduling and have the opposite pods in standby mode so considered one of them can shortly take over if wanted. To make this transition between pods seamless, we applied logic to stop the node from taking place on the prime of a minute when attainable since — given the character of cron — that’s when it’s seemingly that scripts will should be scheduled to run. It’d first seem that having extra nodes processing work as a substitute of only one would higher remedy our issues, since we gained’t have a single level of failure and we wouldn’t have one pod doing the reminiscence and CPU intensive work. Nonetheless, we determined that synchronizing the nodes can be extra of a headache than a assist. We felt this fashion for 2 causes. First, the pods can swap leaders in a short time, making downtime unlikely in apply. And second, we might offload nearly the entire reminiscence and CPU intensive work of truly working the scripts to Slack’s Job Queue and as a substitute use the pod only for the scheduling part. Thus, we’ve one pod scheduling and a number of other different pods ready within the wings.
Job Queue
That brings us to Slack’s Job Queue. The Job Queue is an asynchronous compute platform that runs about 9 billion “jobs” (or items of labor) per day. It consists of a bunch of theoretical “queues” that jobs movement by way of. In easy phrases, these “queues’” are literally a logical option to transfer jobs by way of Kafka (for sturdy storage ought to the system encounter a failure or get backed up) into Redis (for brief time period storage that enables extra metadata of who’s executing the job to be saved alongside the job) after which lastly to a “job employee” — a node able to execute the code — which really runs the job. See this text for extra element. In our case, a job was a single script. Although it’s an asynchronous compute platform, it could execute work in a short time if work is remoted by itself “queue”, which is how we have been capable of benefit from this method. Leveraging this platform allowed us to dump our compute and reminiscence considerations onto an present system that might already deal with the load (and far, rather more). Moreover, since this method already exists and is crucial to how Slack works, we decreased our construct time initially and our upkeep effort going ahead, which is a superb win!
Vitess Database Desk
Lastly, to spherical our service out, we employed a Vitess desk to deal with deduplication and report job monitoring to inside customers (different Slack engineers). Our earlier cron system used flocks, a Linux utility to handle locking in scripts, to make sure that just one copy of a script is working at a time. This only-one requirement is glad by most scripts normally. Nonetheless, there are just a few scripts that take longer than their recurrence, so two copies might begin working on the identical time. In our new system, we file every job execution as a brand new row in a desk and replace the job’s state because it strikes by way of the system (enqueued, in progress, carried out). Thus, once we need to kick off a brand new run of a job, we will verify that there isn’t one working already by querying the desk for lively jobs. We use an index on script names to make this querying quick.
Moreover, since we’re recording the job state within the desk, the desk additionally serves because the backing for a easy internet web page with cron script execution info, in order that customers can simply lookup the state of their script runs and any errors they encountered. This web page is particularly helpful as a result of some scripts can take as much as an hour to run, so customers need to have the ability to confirm that the script continues to be working and that the work they’re anticipating to occur hasn’t failed.
Conclusion
General, our new service for executing cron scripts has made the method extra dependable, scalable, and person pleasant. Whereas having a crontab on a single cron field had gotten us fairly far, it began inflicting us loads of ache and wasn’t maintaining with Slack’s scale. This new system will give Slack the room wanted to develop, each now and much off into the longer term.
Need to assist us work on programs like this? We’re hiring! Apply now