A behind-the-scenes look at the next version of Seq
Thanks to a round of Covid, and a very busy month, this second instalment in our series about Seq's upcoming dataflow-oriented query engine has taken a little longer than planned 😅. Thankfully, while we've been slow to put this blog post together, work on Seq vNext has been proceeding at a rapid pace, and we're increasingly excited about the performance, stability, and expressiveness that the new engine makes possible.
The initial post discussed the various stages of query execution - parsing, planning, optimizing, compiling, and executing - that the engine implements. This post dives into one of the core data structures tying all of these stages togther, our multi-level representation of expressions in the Seq query language.
What kinds of expressions are we talking about?
When you write a Seq query like:
select StatusCode, RequestMethod, RequestPath
from stream
where startswith(RequestPath, '/api')
Seq at some point needs to evaluate startswith(RequestPath, '/api')
. This is what we mean by an "expression", in contrast to query "clauses" like the select
/from
/where
structures.
An expression is just a little computation that executes in some context to produce a single value. The expression embedded in a where
clause is expected to produce a Boolean true
or false
, and an expression in the select
column list might pick out some value from the underlying event.
How are expressions represented?
Here's where things start to get interesting. Typically, expressions are represented as tree structures, with nodes for values, and for operations that perform computations accepting one or more other nodes:
The diagram above is the logical view. In Seq's new query engine, the concrete representation of a function like startswith()
is:
- a string (the function name) at the syntax layer,
- a reference to a function definition at the intermediate layer, and
- a pointer to some executable native code at the runtime layer.
Here's a rough sketch of what a Rust implementation of the expression tree for our startswith()
call could look like at the syntax layer:
// `Value` represents constants that appear in
// expression tree nodes.
enum Value {
Null
Bool(bool),
Integer(i64),
String(String),
}
// `Expression` is the recursively-defined type
// that represents an expression tree.
enum Expression {
// A constant, for example `'/api'`.
Constant(Value),
// An identifier like `RequestPath`.
Identifier(String),
// A function call with function name and argument expressions.
Call(String, Vec<Expression>)
}
Putting these together, our example expression is constructed with:
Expression::Call(
"startswith".into(),
vec![
Box::new(Expression::Identifier("RequestPath".into())),
Box::new(Expression::Constant(Value::String("/api".into())))])
Similar, but different, expressions in context
Seq needs to construct, inspect, and transform expressions at each stage of query execution:
- Syntax trees in the parser
- Intermediate representation in the planner and optimizer
- Executable code for the compiler to bake into a final runtime-callable function
Because the query language is quite sophisticated, defining expression trees and the operations on them is a lot of work. Our current implementation needs ten distinct node types, for everything from constants and function calls, to blocks, conditionals, and lambda functions, and we rely on dozens of accompanying functions to print, compare, search, convert, and reorganize expression trees.
Ideally, we'd like to avoid defining three very similar but separate representations of what are logically the same expressions.
In the earlier C#-based query engine, Seq made use of a single .NET tree structure for all representations. The tree has node types for all syntax, planning, and runtime scenarios, with different node types being valid in different contexts. It takes careful programming to handle an expression tree in the right way, depending on where and at what stage of processing it appears. Over the years, this has been the source of some confusion, and more than a handful of bugs.
In Rust, with the benefit of some experience behind us, we think we can do better!
Parameterizing expressions through language levels
In the query syntax, a function is identified using its name; in the intermediate representation it's essentially an enum member, and at in the executable representation it's a compiler intrinsic or function pointer.
The first version of our multi-level expression tree made use of generics to accept this "call type" as a generic type argument, so we'd have Expression<String>
, Expression<planner::Function>
, and Expression<runtime::Call>
:
enum Expression<T> {
Constant(Value),
Identifier(String),
Call(T, Vec<Expression<T>>)
}
As the engine took shape, we realized that the types of values that could be manipulated in an expression also dependeded on the language level that the expression represented:
- In the syntax we need to deal with identifiers
- In the intermediate representation, values might be well-known inputs like the current time, or columns flowing between dataflow blocks;
- At runtime, values are just memory locations (registers in an abstract machine)
Rather than add a second generic parameter to Expression
, we introduced a new trait describing a particular language level:
trait LanguageLevel {
type Call;
type Variable;
}
The trait defines two associated types, Call
and Variable
. Code relying on the trait can refer to Call
and Variable
, without needing to know what concrete type they map to at a particular language level.
Expression
is parameterized with a language level, and the Expression::Call
and Expression::Variable
members (replacing Identifier
) are defined in terms of them:
enum Expression<L: LanguageLevel> {
Constant(Value),
Variable(L::Variable),
Call(L::Call, Vec<Expression<L>>)
}
For each language level, we now implement the trait:
// In the parser:
struct Syntax;
impl LanguageLevel for Syntax {
type Call = String;
type Variable = String;
}
// In the planner:
struct Intermediate;
enum Function {
Startswith,
// ...
}
enum Data {
Column(usize),
Now,
// ...
}
impl LanguageLevel for Intermediate {
type Call = Function;
type Variable = Data;
}
// Similar definitions for `Runtime` omitted.
And then throughout the codebase, functions and types can accept Expression<Syntax>
, Expression<Intermediate>
, or Expression<Runtime>
, and gain correct, matching, strongly-typed representations of variables and function calls in the expression. As the system grows, the LanguageLevel
trait can be extended to deal with more varition between levels.
While consumers of Expression
can be precise about which language level they operate on, code that provides expression tree manipulation infrastructure can continue to be defined in terms of Expression<L>
, and be available at all language levels.
For example, it's possible to write one function to handle node replacement generically, as long as the particular representation supports equality comparisons and cloning:
fn replace_node<L: LanguageLevel> (
tree: Expression<L>,
from: &Expression<L>,
to: &Expression<L>) -> Expression<L>
where
Expression<L>: Eq, Clone {
// Snip!
}
Curiously, because of type inference, we gain all of this without much (if any) additional noise in code that manipulates expressions:
// Constructing the `Expression<Syntax>` for `startswith(RequestPath, '/api')`
Expression::Call(
"startswith".into(),
vec![
Box::new(Expression::Variable("RequestPath".into())),
Box::new(Expression::Constant(Value::String("/api".into())))])
Expressions in the parser, planner, and executor
We'll see Expression
and friends some more in the coming weeks, as we dig deeper into how we're implementing the new parser, planner, and executor.
Expression
really is a workhorse type that supports a stack of the new query engine's features. We love how Rust's enums, traits, and associated types come together to define a clean and precise model that we can use throughout the codebase.