Enabling static evaluation of SQL queries at Meta –

  • UPM is our inside standalone library to carry out static analysis of SQL code and improve SQL authoring. 
  • UPM takes SQL code as enter and represents it as a knowledge construction referred to as a semantic tree.
  • Infrastructure groups at Meta leverage UPM to construct SQL linters, catch person errors in SQL code, and carry out knowledge lineage evaluation at scale.

Executing SQL queries in opposition to our knowledge warehouse is necessary to the workflows of many engineers and knowledge scientists at Meta for analytics and monitoring use circumstances, both as a part of recurring knowledge pipelines or for ad-hoc knowledge exploration. 

Whereas SQL is extraordinarily highly effective and extremely popular amongst our engineers, we’ve additionally confronted some challenges through the years, particularly: 

  • A necessity for static evaluation capabilities: In a rising variety of use circumstances at Meta, we should perceive programmatically what occurs in SQL queries earlier than they’re executed in opposition to our question engines — a activity referred to as static evaluation.  These use circumstances vary from efficiency linters (suggesting question optimizations that question engines can’t carry out routinely) and analyzing knowledge lineage (tracing how knowledge flows from one desk to a different). This was arduous for us to do for 2 causes: First, whereas question engines internally have some capabilities to investigate a SQL question to be able to execute it, this question evaluation element is often deeply embedded contained in the question engine’s code. It’s not straightforward to increase upon, and it isn’t supposed for consumption by different infrastructure groups. Along with this, every question engine has its personal evaluation logic, particular to its personal SQL dialect; in consequence, a workforce who desires to construct a bit of research for SQL queries must reimplement it from scratch inside of every SQL question engine.
  • A limiting sort system: Initially, we used solely the fastened set of built-in Hive data types (string, integer, boolean, and so on.) to explain desk columns in our knowledge warehouse. As our warehouse grew extra complicated, this set of varieties turned inadequate, because it left us unable to catch widespread classes of person errors, comparable to unit errors (think about making a UNION between two tables, each of which include a column referred to as timestamp, however one is encoded in milliseconds and the opposite one in nanoseconds), or ID comparability errors (think about a JOIN between two tables, every with a column referred to as user_id — however, the truth is, these IDs are issued by totally different programs and due to this fact can’t be in contrast).

How UPM works

To handle these challenges, we have now constructed UPM (Unified Programming Mannequin). UPM takes in an SQL question as enter and represents it as a hierarchical knowledge construction referred to as a semantic tree.

 For instance, when you go on this question to UPM:

SELECT
COUNT(DISTINCT user_id) AS n_users
FROM login_events

UPM will return this semantic tree:

SelectQuery(
 	objects=[
 	SelectItem(
       	name="n_users",
       	type=upm.Integer,
       	value=CallExpression(
            	function=upm.builtin.COUNT_DISTINCT,
                arguments=[ColumnRef(name="user_id", parent=Table("login_events"))],
       	),
 	)
    ],
    dad or mum=Desk("login_events"),
)

 Different instruments can then use this semantic tree for various use circumstances, comparable to:

  1. Static evaluation: A software can examine the semantic tree after which output diagnostics or warnings in regards to the question (comparable to a SQL linter).
  2. Question rewriting: A software can modify the semantic tree to rewrite the question.
  3. Question execution: UPM can act as a pluggable SQL entrance finish, that means {that a} database engine or question engine can use a UPM semantic tree on to generate and execute a question plan. (The phrase front end on this context is borrowed from the world of compilers; the entrance finish is the a part of a compiler that converts higher-level code into an intermediate illustration that may finally be used to generate an executable program). Alternatively, UPM can render the semantic tree again right into a goal SQL dialect (as a string) and go that to the question engine.

A unified SQL language entrance finish

UPM permits us to supply a single language entrance finish to our SQL customers in order that they solely must work with a single language (a superset of the Presto SQL dialect) — whether or not their goal engine is Presto, Spark, or XStream, our in-house stream processing service.

This unification can be useful to our knowledge infrastructure groups: Because of this unification, groups that personal SQL static evaluation or rewriting instruments can use UPM semantic bushes as a typical interop format, with out worrying about parsing, evaluation, or integration with totally different SQL question engines and SQL dialects. Equally, very like Velox can act as a pluggable execution engine for knowledge administration programs, UPM can act as a pluggable language entrance finish for knowledge administration programs, saving groups the hassle of sustaining their very own SQL entrance finish.

Enhanced type-checking

UPM additionally permits us to supply enhanced type-checking of SQL queries.

 In our warehouse, every desk column is assigned a “bodily” sort from a set record, comparable to integer or string. Moreover, every column can have an optionally available user-defined sort; whereas it doesn’t have an effect on how the information is encoded on disk, this kind can provide semantic data (e.g., E-mail, TimestampMilliseconds, or UserID). UPM can benefit from these user-defined varieties to enhance static type-checking of SQL queries.

 For instance, an SQL question creator would possibly wish to UNION knowledge from two tables that include details about totally different login occasions:

 Within the question on the correct, the creator is attempting to mix timestamps in milliseconds from the desk user_login_events_mobile with timestamps in nanoseconds from the desk user_login_events_desktop — an comprehensible mistake, as the 2 columns have the identical identify. However as a result of the tables’ schema have been annotated with user-defined varieties, UPM’s typechecker catches the error earlier than the question reaches the question engine; it then notifies the creator of their code editor. With out this verify, the question would have accomplished efficiently, and the creator won’t have seen the error till a lot later.

Column-level knowledge lineage

Information lineage — understanding how knowledge flows inside our warehouse and thru to consumption surfaces — is a foundational piece of our knowledge infrastructure. It allows us to reply knowledge high quality questions (e.g.,“This knowledge appears to be like incorrect; the place is it coming from?” and “Information on this desk have been corrupted; which downstream knowledge belongings have been impacted?”). It additionally helps with knowledge refactoring (“Is that this desk protected to delete? Is anybody nonetheless relying on it?”). 

 To assist us reply these essential questions, our knowledge lineage workforce has constructed a question evaluation software that takes UPM semantic bushes as enter. The software examines all recurring SQL queries to construct a column-level knowledge lineage graph throughout our total warehouse. For instance, given this question:

INSERT INTO user_logins_daily_agg
SELECT
   DATE(login_timestamp) AS day,
   COUNT(DISTINCT user_id) AS n_users
FROM user_login_events
GROUP BY 1

Our UPM-powered column lineage evaluation would deduce these edges:

[
   from: “user_login_events.login_timestamp”,
   to: “user_login_daily_agg.day”,
   transform: “DATE”
,

   from: “user_login_events.user_id”,
   to: “user_logins_daily_agg.n_user”,
   transform: “COUNT_DISTINCT”
]  

By placing this data collectively for each question executed in opposition to our knowledge warehouse every day, the software exhibits us a world view of the total column-level knowledge lineage graph.

What’s subsequent for UPM

We sit up for extra thrilling work as we proceed to unlock UPM’s full potential at Meta. Ultimately, we hope all Meta warehouse tables will likely be annotated with user-defined varieties and different metadata, and that enhanced type-checking will likely be strictly enforced in each authoring floor. Most tables in our Hive warehouse already leverage user-defined varieties, however we’re rolling out stricter type-checking guidelines steadily, to facilitate the migration of current SQL pipelines.

We have now already built-in UPM into the principle surfaces the place Meta’s builders write SQL, and our long-term purpose is for UPM to turn into Meta’s unified SQL entrance finish: deeply built-in into all our question engines, exposing a single SQL dialect to our builders. We additionally intend to iterate on the ergonomics of this unified SQL dialect (for instance, by permitting trailing commas in SELECT clauses and by supporting syntax constructs like SELECT * EXCEPT <some_columns>, which exist already in some SQL dialects) and to finally increase the extent of abstraction at which individuals write their queries.