Logarithm: A logging engine for AI coaching workflows and providers
- Programs and utility logs play a key function in operations, observability, and debugging workflows at Meta.
- Logarithm is a hosted, serverless, multitenant service, used solely internally at Meta, that consumes and indexes these logs and supplies an interactive question interface to retrieve and look at logs.
- On this submit, we current the design behind Logarithm, and present the way it powers AI coaching debugging use circumstances.
Logarithm indexes 100+GB/s of logs in actual time, and 1000’s of queries a second. We designed the system to help service-level ensures on log freshness, completeness, sturdiness, question latency, and question end result completeness. Customers can emit logs utilizing their selection of logging library (the frequent library at Meta is the Google Logging Library [glog]). Customers can question utilizing common expressions on log strains, arbitrary metadata fields connected to logs, and throughout log recordsdata of hosts and providers.
Logarithm is written in C++20 and the codebase follows trendy C++ patterns, together with coroutines and async execution. This has supported each efficiency and maintainability, and helped the crew transfer quick – growing Logarithm in simply three years.
Logarithm’s information mannequin
Logarithm represents logs as a named log stream of (host-local) time-ordered sequences of immutable unstructured textual content, similar to a single log file. A course of can emit a number of log streams (stdout, stderr, and customized log recordsdata). Every log line can have zero or extra metadata key-value pairs connected to it. A typical instance of metadata is rank ID in machine studying (ML) coaching, when a number of sequences of log strains are multiplexed right into a single log stream (e.g., in PyTorch).
Logarithm helps typed buildings in two methods – by way of typed APIs (ints, floats, and strings), and extraction from a log line utilizing regex-based parse-and-extract guidelines – a standard instance is metrics of tensors in ML mannequin logging. The extracted key-value pairs are added to the log line’s metadata.
AI coaching debugging with Logarithm
Earlier than Logarithm’s internals, we current help for coaching techniques and mannequin difficulty debugging, one of many outstanding use circumstances of Logarithm at Meta. ML mannequin coaching workflows are inclined to have a variety of failure modes, spanning information inputs, mannequin code and hyperparameters, and techniques parts (e.g., PyTorch, information readers, checkpointing, framework code, and {hardware}). Additional, failure root causes evolve over time sooner than conventional service architectures because of rapidly-evolving workloads, from scale to architectures to sharding and optimizations. So as to triage such a dynamic nature of failures, it’s essential to gather detailed telemetry on the techniques and mannequin telemetry.
Since coaching jobs run for prolonged durations of time, coaching techniques and mannequin telemetry and state should be constantly captured so as to have the ability to debug a failure with out reproducing the failure with extra logging (which might not be deterministic and wastes GPU sources).
Given the size of coaching jobs, techniques and mannequin telemetry are typically detailed and really high-throughput – logs are comparatively low cost to put in writing (e.g., in comparison with metrics, relational tables, and traces) and have the data content material to energy debugging use circumstances.
We stream, index and question high-throughput logs from techniques and mannequin layers utilizing Logarithm.
Logarithm ingests each techniques logs from the coaching stack, and mannequin telemetry from coaching jobs that the stack executes. In our setup, every host runs a number of PyTorch ranks (processes), one per GPU, and the processes write their output streams to a single log file. Debugging distributed job failures results in ambiguity because of lack of rank info in log strains, and including it implies that we modify all logging websites (together with third-party code). With the Logarithm metadata API, course of context reminiscent of rank ID is connected to each log line – the API provides it into thread-local context and attaches a glog handler.
We added UI instruments to allow frequent log-based interactive debugging primitives. The next figures present screenshots of two such options (on high of Logarithm’s filtering operations).
Filter–by-callsite allows hiding recognized log strains or verbose/noisy logging websites when strolling by a log stream. Strolling by a number of log streams side-by-side allows discovering rank state that’s totally different from different ranks (i.e., extra strains or lacking strains), which usually is a symptom or root trigger. That is immediately a results of the only program, a number of information nature of manufacturing coaching jobs, the place each rank iterates on information batches with the identical code (with batch-level limitations).
Logarithm ingests steady mannequin telemetry and abstract statistics that span mannequin enter and output tensors, mannequin properties (e.g., studying price), mannequin inside state tensors (e.g., neuron activations) and gradients throughout coaching. This powers reside coaching mannequin monitoring dashboards reminiscent of an inside deployment of TensorBoard, and is utilized by ML engineers to debug mannequin convergence points and coaching failures (because of gradient/loss explosions) utilizing notebooks on uncooked telemetry.
Mannequin telemetry tends to be iteration-based tensor timeseries with dimensions (e.g., mannequin structure, neuron, or module names), and tends to be high-volume and high-throughput (which makes low-cost ingestion in Logarithm a pure selection). Collocating techniques and mannequin telemetry allows debugging points that cascade from one layer to the opposite. The mannequin telemetry APIs internally write timeseries and dimensions as typed key-value pairs utilizing the Logarithm metadata API. Multimodal information (e.g., photos) are captured as references to recordsdata written to an exterior blob retailer.
Mannequin telemetry dashboards sometimes are typically numerous timeseries visualizations organized in a grid – this allows ML engineers to eyeball spatial and temporal dynamics of the mannequin exterior and inside state over time and discover anomalies and correlation construction. A single dashboard therefore must get a considerably giant variety of timeseries and their tensors. So as to render at interactive latencies, dashboards batch and fan out queries to Logarithm utilizing the streaming API. The streaming API returns outcomes with random ordering, which allows dashboards to incrementally render all plots in parallel – inside 100s of milliseconds to the primary set of samples and inside seconds to the complete set of factors.
Logarithm’s system structure
Our purpose behind Logarithm is to construct a extremely scalable and fault tolerant system that helps high-throughput ingestion and interactive question latencies; and supplies robust ensures on availability, sturdiness, freshness, completeness, and question latency.
At a excessive stage, Logarithm includes the next parts:
- Utility processes emit logs utilizing logging APIs. The APIs help emitting unstructured log strains together with typed metadata key-value pairs (per-line).
- A bunch-side agent discovers the format of strains and parses strains for frequent fields, reminiscent of timestamp, severity, course of ID, and callsite.
- The ensuing object is buffered and written to a distributed queue (for that log stream) that gives sturdiness ensures with days of object lifetime.
- Ingestion clusters learn objects from queues, and help extra parsing primarily based on any user-defined regex extraction guidelines – the extracted key-value pairs are written to the road’s metadata.
- Question clusters help interactive and bulk queries on a number of log streams with predicate filters on log textual content and metadata.
Logarithm shops locality of knowledge blocks in a central locality service. We implement this on a hosted, extremely partitioned and replicated assortment of MySQL situations. Each block that’s generated at ingestion clusters is written as a set of locality rows (one for every log stream within the block) to a deterministic shard, and reads are distributed throughout replicas for a shard. For scalability, we don’t use distributed transactions for the reason that workload is append-only. Word that for the reason that ingestion processing throughout log streams just isn’t coordinated by design (for scalability), federated queries throughout log streams might not return the identical last-logged timestamps between log streams.
Our design selections focus on layering storage, question, and log analytics and ease in state distribution. We design for 2 frequent properties of logs: they’re written greater than queried, and up to date logs are typically queried greater than older ones.
Design selections
Logarithm shops logs as blocks of textual content and metadata and maintains secondary indices to help low latency lookups on textual content and/or metadata. Since logs quickly lose question chance with time, Logarithm tiers the storage of logs and secondary indices throughout bodily reminiscence, native SSD, and a distant sturdy and extremely out there blob storage service (at Meta we use Manifold). Along with secondary indices, tiering additionally ensures the bottom latencies for probably the most accessed (latest) logs.
Light-weight disaggregated secondary indices. Sustaining secondary indices on disaggregated blob storage magnifies information lookup prices at question time. Logarithm’s secondary indices are designed to be light-weight, utilizing Bloom filters. The Bloom filters are prefetched (or loaded on-query) right into a distributed cache on the question clusters when blocks are revealed on disaggregated storage, to cover community latencies on index lookups. We later added help for information blocks within the question cache when executing a question. The system tries to collocate information from the identical log stream to be able to scale back fan outs and stragglers throughout question processing. The logs and metadata are applied as ORC recordsdata. The Bloom filters presently index log stream locality and metadata key-value info (i.e., min-max values and Bloom filters for every column of ORC stripes).
Logarithm separates compute (ingestion and question) and storage to quickly scale out the amount of log blocks and secondary indices. The exception to that is the in-memory memtable on ingestion clusters that buffer time-ordered lists of log streams, which is a staging space for each writes and reads. The memtable is a bounded per-log stream buffer of the latest and lengthy sufficient time window of logs which might be anticipated to be queried. The ingestion implementation is designed to be I/O-bound and never compute or host reminiscence bandwidth-heavy to deal with near GB/s of per-host ingestion streaming. To attenuate memtable competition, we implement a number of memtables, for staging, and an immutable prior model for serializing to disk. Ingestion implementation follows zero-copy semantics.
Equally, Logarithm separates ingestion and question sources to make sure bulk processing (ingestion) and interactive workloads don’t impression one another. Word that Logarithm’s design makes use of schema-on-write, however the information mannequin and parsing computation is distributed between the logging hosts (which scales ingestion compute), and optionally, the ingestion clusters (for user-defined parsing). Prospects can add extra anticipated capability for storage (e.g., elevated retention limits), ingestion and question workloads.
Logarithm pushes down distributed state upkeep to disaggregated storage layers (as an alternative of replicating compute at ingestion layer). The disaggregated storage in Manifold makes use of read-write quorums to supply robust consistency, sturdiness and availability ensures. The distributed queues in Scribe use LogDevice for sustaining objects as a sturdy replicated log. This simplifies ingestion and question tier fault tolerance. Ingestion nodes stream serialized objects on native SSDs to Manifold in 20-min. epochs, and checkpoint Scribe offsets on Manifold. When a failed ingestion node is changed, the brand new node downloads the final epoch of knowledge from Manifold, and restarts ingesting uncooked logs from the final Scribe checkpoint.
Ingestion elasticity. The Logarithm management airplane (primarily based on Shard Supervisor) tracks ingestion node well being and log stream shard-level hotspots, and relocates shards to different nodes when it finds points or load. When there is a rise in logs written in a log stream, the management airplane scales out the shard rely and allocates new shards on ingestion nodes with out there sources. The system is designed to supply useful resource isolation at ingestion-time between log streams. If there’s a vital surge in very brief timescales, the distributed queues in Scribe take up the spikes, however when the queues are full, the log stream can lose logs (till elasticity mechanisms improve shard counts). Such spikes sometimes are inclined to end result from logging bugs (e.g., verbosity) in utility code.
Question processing. Queries are routed randomly throughout the question clusters. When a question node receives a request, it assumes the function of an aggregator and partitions the request throughout a bounded subset of question cluster nodes (balancing between cluster load and question latency). The aggregator pushes down filter and kind operators to question nodes and returns sorted outcomes (an end-to-end blocking operation). The question nodes learn their partitions of logs by trying up locality, adopted by secondary indices and information blocks – the learn can span the question cache, ingestion nodes (for most up-to-date logs) and disaggregated storage. We added 2x replication of the question cache to help question cluster load distribution and quick failover (with out ready for cache shard motion). Logarithm additionally supplies a streaming question API with randomized and incremental sampling that returns filtered logs (an end-to-end non-blocking operation) for lower-latency reads and time-to-first-log. Logarithm paginates end result units.
Logarithm can tradeoff question end result completeness or ordering to take care of question latency (and flag to the shopper when it does so). For instance, this may be the case when a partition of a question is gradual or when the variety of blocks to be learn is just too excessive. Within the former, it instances out and skips the straggler. Within the latter state of affairs, it begins from skipped blocks (or offsets) when processing the following end result web page. In apply, we offer ensures for each end result completeness and question latency. That is primarily possible for the reason that system has mechanisms to cut back the chance of root causes that result in stragglers. Logarithm additionally does question admission management at shopper or user-level.
The next figures characterize Logarithm’s mixture manufacturing efficiency and scalability throughout all log streams. They spotlight scalability on account of design selections that make the system less complicated (spanning disaggregation, ingestion-query separation, indexes, and fault tolerance design). We current our manufacturing service-level aims (SLOs) over a month, that are outlined because the fraction of time they violate thresholds on availability, sturdiness (together with completeness), freshness, and question latency.
Logarithm helps robust safety and privateness ensures. Entry management might be enforced on a per-log line granularity at ingestion and query-time. Log streams can have configurable retention home windows with line-level deletion operations.
Subsequent steps
Over the previous few years, a number of use circumstances have been constructed over the foundational log primitives that Logarithm implements. Programs reminiscent of relational algebra on structured information and log analytics are being layered on high with Logarithm’s question latency ensures – utilizing pushdowns of search-filter-sort and federated retrieval operations. Logarithm helps a local UI for interactive log exploration, search, and filtering to help debugging use circumstances. This UI is embedded as a widget in service consoles throughout Meta providers. Logarithm additionally helps a CLI for bulk obtain of service logs for scripting analyses.
The Logarithm design has centered round simplicity for scalability ensures. We’re constantly constructing domain-specific and agnostic log analytics capabilities inside or layered on Logarithm with applicable pushdowns for efficiency optimizations. We proceed to spend money on storage and query-time enhancements, reminiscent of light-weight disaggregated inverted indices for textual content search, storage layouts optimized for queries and distributed debugging UI primitives for AI techniques.
Acknowledgements
We thank Logarithm crew’s present and previous members, notably our leads: Amir Alon, Stavros Harizopoulos, Rukmani Ravisundaram, Laurynas Sukys, Hanwen Zhang; and our management: Vinay Perneti, Shah Rahman, Nikhilesh Reddy, Gautam Shanbhag, Girish Vaitheeswaran, and Yogesh Upadhay. Thanks to our companions and prospects: Sergey Anpilov, Jenya (Eugene) Lee, Aravind Ram, Vikram Srivastava, and Mik Vyatskov.