Modeling expressions in Rust, from syntax to execution

A behind-the-scenes look at the next version of Seq

Thanks to a round of Covid, and a very busy month, this second instalment in our series about Seq's upcoming dataflow-oriented query engine has taken a little longer than planned 😅. Thankfully, while we've been slow to put this blog post together, work on Seq vNext has been proceeding at a rapid pace, and we're increasingly excited about the performance, stability, and expressiveness that the new engine makes possible.

The initial post discussed the various stages of query execution - parsing, planning, optimizing, compiling, and executing - that the engine implements. This post dives into one of the core data structures tying all of these stages togther, our multi-level representation of expressions in the Seq query language.

What kinds of expressions are we talking about?

When you write a Seq query like:

select StatusCode, RequestMethod, RequestPath
from stream
where startswith(RequestPath, '/api')

Seq at some point needs to evaluate startswith(RequestPath, '/api'). This is what we mean by an "expression", in contrast to query "clauses" like the select/from/where structures.

An expression is just a little computation that executes in some context to produce a single value. The expression embedded in a where clause is expected to produce a Boolean true or false, and an expression in the select column list might pick out some value from the underlying event.

How are expressions represented?

Here's where things start to get interesting. Typically, expressions are represented as tree structures, with nodes for values, and for operations that perform computations accepting one or more other nodes:

The diagram above is the logical view. In Seq's new query engine, the concrete representation of a function like startswith() is:

a string (the function name) at the syntax layer,
a reference to a function definition at the intermediate layer, and
a pointer to some executable native code at the runtime layer.

Here's a rough sketch of what a Rust implementation of the expression tree for our startswith() call could look like at the syntax layer:

// `Value` represents constants that appear in
// expression tree nodes.
enum Value {
    Null
    Bool(bool),
    Integer(i64),
    String(String),
}

// `Expression` is the recursively-defined type
// that represents an expression tree.
enum Expression {
    // A constant, for example `'/api'`.
    Constant(Value),
    // An identifier like `RequestPath`.
    Identifier(String),
    // A function call with function name and argument expressions.
    Call(String, Vec<Expression>)
}

Putting these together, our example expression is constructed with:

Expression::Call(
    "startswith".into(),
    vec![
        Box::new(Expression::Identifier("RequestPath".into())),
        Box::new(Expression::Constant(Value::String("/api".into())))])

Similar, but different, expressions in context

Seq needs to construct, inspect, and transform expressions at each stage of query execution:

Syntax trees in the parser
Intermediate representation in the planner and optimizer
Executable code for the compiler to bake into a final runtime-callable function

Because the query language is quite sophisticated, defining expression trees and the operations on them is a lot of work. Our current implementation needs ten distinct node types, for everything from constants and function calls, to blocks, conditionals, and lambda functions, and we rely on dozens of accompanying functions to print, compare, search, convert, and reorganize expression trees.

Ideally, we'd like to avoid defining three very similar but separate representations of what are logically the same expressions.

In the earlier C#-based query engine, Seq made use of a single .NET tree structure for all representations. The tree has node types for all syntax, planning, and runtime scenarios, with different node types being valid in different contexts. It takes careful programming to handle an expression tree in the right way, depending on where and at what stage of processing it appears. Over the years, this has been the source of some confusion, and more than a handful of bugs.

In Rust, with the benefit of some experience behind us, we think we can do better!

Parameterizing expressions through language levels

In the query syntax, a function is identified using its name; in the intermediate representation it's essentially an enum member, and at in the executable representation it's a compiler intrinsic or function pointer.

The first version of our multi-level expression tree made use of generics to accept this "call type" as a generic type argument, so we'd have Expression<String>, Expression<planner::Function>, and Expression<runtime::Call>:

enum Expression<T> {
    Constant(Value),
    Identifier(String),
    Call(T, Vec<Expression<T>>)
}

As the engine took shape, we realized that the types of values that could be manipulated in an expression also dependeded on the language level that the expression represented:

In the syntax we need to deal with identifiers
In the intermediate representation, values might be well-known inputs like the current time, or columns flowing between dataflow blocks;
At runtime, values are just memory locations (registers in an abstract machine)

Rather than add a second generic parameter to Expression, we introduced a new trait describing a particular language level:

trait LanguageLevel {
    type Call;
    type Variable;
}

The trait defines two associated types, Call and Variable. Code relying on the trait can refer to Call and Variable, without needing to know what concrete type they map to at a particular language level.

Expression is parameterized with a language level, and the Expression::Call and Expression::Variable members (replacing Identifier) are defined in terms of them:

enum Expression<L: LanguageLevel> {
    Constant(Value),
    Variable(L::Variable),
    Call(L::Call, Vec<Expression<L>>)
}

For each language level, we now implement the trait:

// In the parser:

struct Syntax;

impl LanguageLevel for Syntax {
    type Call = String;
    type Variable = String;
}

// In the planner:

struct Intermediate;

enum Function {
    Startswith,
    // ...
}

enum Data {
    Column(usize),
    Now,
    // ...
}

impl LanguageLevel for Intermediate {
    type Call = Function;
    type Variable = Data;
}

// Similar definitions for `Runtime` omitted.

And then throughout the codebase, functions and types can accept Expression<Syntax>, Expression<Intermediate>, or Expression<Runtime>, and gain correct, matching, strongly-typed representations of variables and function calls in the expression. As the system grows, the LanguageLevel trait can be extended to deal with more varition between levels.

While consumers of Expression can be precise about which language level they operate on, code that provides expression tree manipulation infrastructure can continue to be defined in terms of Expression<L>, and be available at all language levels.

For example, it's possible to write one function to handle node replacement generically, as long as the particular representation supports equality comparisons and cloning:

fn replace_node<L: LanguageLevel> (
    tree: Expression<L>,
    from: &Expression<L>,
    to: &Expression<L>) -> Expression<L>
where
    Expression<L>: Eq, Clone {
    // Snip!
}

Curiously, because of type inference, we gain all of this without much (if any) additional noise in code that manipulates expressions:

// Constructing the `Expression<Syntax>` for `startswith(RequestPath, '/api')`
Expression::Call(
    "startswith".into(),
    vec![
        Box::new(Expression::Variable("RequestPath".into())),
        Box::new(Expression::Constant(Value::String("/api".into())))])

Expressions in the parser, planner, and executor

We'll see Expression and friends some more in the coming weeks, as we dig deeper into how we're implementing the new parser, planner, and executor.

Expression really is a workhorse type that supports a stack of the new query engine's features. We love how Rust's enums, traits, and associated types come together to define a clean and precise model that we can use throughout the codebase.