Tracing Notifications – Slack Engineering
Notifications are a key side of the Slack consumer expertise. Customers depend on well timed notifications of mentions and DMs to maintain on prime of essential info. Poor notification completeness erodes the belief of all Slack customers.
Notifications move by virtually all of the techniques in our infrastructure. As illustrated in Determine 1 under, a notification request flows by the webapp (our utility logic and net / Desktop shopper monorepo), job queue, push service, and several other third-party providers earlier than hitting our iOS, Android, Desktop, or net shoppers.
Additional, the choice about when and the place to ship a notification can also be very sophisticated, as proven in Determine 2 under, which is from our 2017 weblog put up (additionally summarized here).
Since 2017, our notification workflow has solely grown extra complicated, by the addition of latest options like Huddles and Canvas. Consequently, fixing notification points can result in multi-day debugging periods throughout a number of groups. Buyer tickets associated to notifications additionally had the bottom NPS scores and took the longest time to resolve in comparison with different buyer points.
Debugging notification points inside our techniques was tough as a result of every system had a special logging pipeline and information format, making it essential to have a look at information with completely different codecs and backends. This course of required deep technical experience and took a number of days to finish. The context during which occasions have been logged additionally diverse throughout techniques, prolonging any investigations. This resulted in a time-consuming course of requiring experience in all components of the stack simply to know what occurred.
We started a mission to hint the move of notifications throughout our techniques to deal with these challenges. The objective was to standardize the info format and semantics of occasions to make it simpler to know and debug notification information. We wished to reply questions on notifications corresponding to: if it was despatched, the place it was despatched, if it was seen, and if the consumer had opened it. This put up paperwork our multi-quarter, cross-organizational journey of tracing notifications all through Slack’s backend techniques, and the way we use this hint information to enhance the Slack buyer expertise for everybody.
Notification move
The sequence of steps to know how notifications have been despatched and acquired is one thing we’ve dubbed the “notification move.” Step one to enhance the notification move was to mannequin the steps within the notification course of the identical manner throughout all our shoppers. We additionally aimed to seize all occasions in a standard information mannequin persistently in the identical format.
We created a notification spec to know all of the occasions in a notification hint. This concerned figuring out all of the occasions in a hint, creating an idealized funnel, and setting the context during which every occasion can be logged. We additionally needed to agree on the semantics of a span and the names of the occasions, which was a difficult process throughout completely different platforms. The result’s a notification move (simplified for this weblog put up), proven within the picture under.
Mapping notification move to a hint
After we completed planning the move of our system, we would have liked to choose a approach to preserve observe of that info. We selected to make use of SlackTrace as a result of a hint was a pure approach to signify a move, and all of the components of our system can already ship info within the span occasion format. Nonetheless, we encountered two main challenges when modeling notification flows as traces.
- 100% sampling for notification flows: In contrast to backend requests—which have been sampled at 1%—notification flows shouldn’t be sampled since our CE workforce wished 100% constancy to reply all buyer requests. In some situations like `@right here` and `@channel`, a push notification message could be probably despatched to tons of of hundreds of customers throughout a number of gadgets, leading to billions of spans for a single hint of a slack message. A hint with probably billions of spans would wreak havoc on our hint ingestion pipeline and storage backends. No sampling would additionally power us to hint each Slack message despatched.
- Tracing notifications as a move separate from the unique message despatched hint. Presently, OpenTelemetry (OpenTracing) instrumentation tightly {couples} tracing to a request context. In a notification move, this tight coupling would break because the notification move executes in a number of contexts and doesn’t cleanly map to a single request context. Additional, mixing a number of hint contexts additionally made implementing tracing throughout our code difficult.
To unravel each of those challenges we determined to mannequin every notification despatched as its personal hint. To tie the sender’s hint to every of the notifications despatched, we used span links to causally hyperlink the spans collectively. Every notification was assigned a notification_id which was used as a trace_id for the notification move.
This strategy has a number of benefits:
- Since SlackTrace’s instrumentation doesn’t tightly couple hint context propagation with request context propagation, modeling these flows drastically simplifies the hint instrumentation.
- Since every notification despatched was its personal hint, it made the traces smaller and simpler to retailer and question.
- It allowed 100% sampling for notification traces, whereas retaining the senders sampling price at 1%.
- Span linking helped us protect causality for the hint information.
Totally different groups labored collectively to map the steps within the notification move to a span. The result’s a desk as proven under.
Span title | Description | Hint id | Dad or mum span id | Span tags |
notification:set off | Decide if the notification must be despatched or not. | Trace_id is the request id. Span hyperlinks have an inventory of notification_id’s despatched. | trigger_type (DM, @right here, @channel), user_id, team_id channel_id message_ts notification_id | |
notification:notify | Notify the consumer on all of their shoppers. | Trace_id is notification_id. | Id of notification:set off span. | user_id, team_id channel_id message_ts |
notification:despatched | Notification is distributed to a slack shopper to all of the a number of slack shoppers on the consumer’s machine. | Trace_id is notification_id | ID of notification:notify | channel_id platform particular notification tags. |
notification:acquired | Notification is acquired on the consumer’s slack shopper. | Trace_id is notification_id | ID of notification:despatched span. | Service title is shopper title and shopper tags. |
notification:opened | Person opened a notification on the machine. | Trace_id is notification_id | ID of notification:acquired span. | Service title is shopper title and shopper tags. |
notification:learn in app | Person clicked on the notification to view the notification within the app.The beginning of the span is true after opening. The top of the span is when the message is rendered within the channel. | Trace_id is notification_id | ID of notification:opened span. | Service title is shopper title and shopper tags. |
Benefits of modeling a notification move as a hint
Representing the notification move as a Hint/SpanEvent has the next benefits over our present strategies.
- Constant information format: Since all of the providers reported the info as a Span, the info from numerous backend and shopper techniques was in the identical format.
- Service title to determine supply: We set the service title discipline to Desktop, iOS, or Android to uniquely determine the shopper or service that generated an occasion.
- Commonplace names for contexts: We used the span title and repair title to uniquely determine an occasion throughout techniques. For instance, the service title for a notification :acquired occasion could be iOS, Android and Internet to precisely tag these occasions. Beforehand, the occasions from these three shoppers would have completely different codecs and it was arduous to uniformly question them.
- Standardized timestamps and period fields: All of the occasions have a constant timestamp in the identical decision and time zone as the remainder of the occasions. If there’s a period related to an occasion, we set the period discipline or set it to a default worth of 1 when reporting a one-off occasion. This offered a single place for storing all of our period info.
- Constructed-in periods: We’d use the notification ID because the hint ID for your entire move. Consequently all of the occasions in a move are already sessionized and there’s no have to additional sessionize the info. For instance, we couldn’t use the notification ID because the be part of key all over the place since just some occasions would have a notification ID. For instance, the notification triggered of a notification learn occasion wouldn’t have a notification ID in them. We will use the hint ID to tie these occasions collectively as a substitute of utilizing bespoke occasions.
- Clear, easy, and dependable instrumentation: Since a hint is sessionized, we solely want so as to add the tags to the hint as soon as after we mannequin the notification move as a hint. This additionally made the instrumentation code cleaner, easier, and dependable because the modifications have been localized to small components of the code that may be unit examined properly. It additionally made the info simpler to make use of since there is just one be part of key as a substitute of bespoke be part of key for some subset of occasions.
- Versatile information mannequin: This mannequin can also be versatile and extendable. If a shopper wants so as to add extra context, they will add extra tags to an present span. If not one of the present spans are match, they will add a brand new span to the hint, with out altering the prevailing hint information or hint queries.
- No duplicate occasions: The SpanID within the occasion helped seize the individuality of occasions at supply. This diminished the variety of occasions that have been double reported and eliminated the necessity to de-dupe occasions in our backend once more. The older technique reported thrift objects with out distinctive IDs which led to utilizing de-dupe jobs to determine double reporting of occasions.
- Span linking for tying associated traces collectively: Linking spans throughout traces helps protect causality with out resorting to advert hoc information modeling.
How we use notification hint information at Slack
After a number of quarters of arduous work by a number of groups we have been in a position to hint notifications end-to-end throughout all of the Slack shoppers. Our traces have been despatched to a real-time retailer and our information warehouse utilizing the hint ingestion pipeline.
Builders use the notification hint information to triage points. Beforehand, monitoring notification failures concerned going by logs of a number of techniques to know the place a notification was dropped. This course of was concerned and took a number of hours of very senior engineers’ time to know what went on. Nonetheless, after notification tracing, anybody was in a position to have a look at a hint of the notification to exactly see the place a hint was despatched and the place within the move a notification was dropped.
Our buyer expertise workforce makes use of hint information to triage buyer points a lot sooner today. We now know exactly the place within the notification move a message dropped. Since our traces are simpler to learn, our CE engineers can have a look at a hint to be taught what occurred in a notification to reply a buyer’s question as a substitute of escalating it to the event workforce, who then needed to comb by the various logs. This helped us triage our notifications way more shortly, and diminished the time to triage notification tickets for our CE workforce by 30%.
Notification analytics
Presently, we ingest notification hint information to ElasticSearch/Grafana and our information warehouse.
Our iOS engineers and Android engineers have began utilizing this information to construct Grafana dashboards and alerts to know the efficiency of our shoppers. Usually, shopper engineers don’t use dashboarding instruments like Grafana, however our shopper engineers have used them very successfully to triage and debug points in our notification move.
We’ve got additionally ingested this information into our information warehouse, over which anybody can run complicated analytics on this information. Initially information scientists used this information to know efficiency regressions in our shoppers over lengthy intervals of time.
The span occasion format and tracing system additionally has an surprising profit. Our information scientists used this information to construct a product analytics dashboard exhibiting funnel analytics on notification flows, to higher perceive notification open charges. Usually, that product analytics information could be captured by a separate set of instrumentation ingested by way of a special pipeline into the info warehouse. Nonetheless, since we despatched the hint information to the info warehouse, our information scientists can use it to compute funnel analytics on the info to get the identical insights.
An much more extraordinary end result was when the info scientists have been in a position to mine the hint information to determine and report bugs in utility and instrumentation. Previously two years since, notification traces have been used many occasions exterior of the preliminary use case. This reveals the benefits of utilizing hint information as a single supply of fact, because of its assist for a number of use instances.
Conclusion
Modeling flows or funnels as a hint is a superb concept, however there are some challenges. On this weblog put up we’ve proven how Slack modeled notification flows as traces, the challenges we confronted, and the right way to overcome these challenges by cautious modeling.
Implementing notification tracing wouldn’t have been attainable with out decoupling the hint context propagation from a request context within the SlackTrace framework. The instrumentation helped us shortly and cleanly implement tracing throughout a number of backend providers, whereas avoiding the destructive unintended effects of present libraries, corresponding to cluttered instrumentation and enormous traces. Presently, we instrument a number of different flows within the manufacturing Slack app utilizing the identical technique.
Modeling notification flows as hint information helped our CE workforce resolve notification points 30% sooner whereas additionally lowering escalations to the event workforce.
Along with the unique use case of debugging notification points, notification hint information was additionally used for calculating funnel analytics for manufacturing analytics use instances. Modeling product analytics information as traces offers high-quality information in a constant information format throughout all of our complicated stack. Additional, the built-in sessionization of hint information simplified our analytics pipeline by eliminating extra jobs to de-dupe and sessionize the hint information. Previously two years, backend and frontend builders and information scientists have used the hint information as a single supply of fact for a number of use instances.
The success of notification tracing has inspired a number of different use instances the place flows are modeled as traces at Slack. As we speak within the Slack app there are not less than a dozen tracers working concurrently within the Slack app.
Considering taking up attention-grabbing initiatives, making individuals’s work lives simpler, or optimizing some code? We’re hiring! 💼 Apply now