Monday, 16 April 2018

An Introduction to CQL


Recently I have been working on a CommonMark extension for PHP7. It is based on the reference implementation in C, linking to it rather than re-implementing the spec.

The reference implementation in C is extremely fast, and so the extension has a focus on performance, trying to create PHP objects only when necessary, among other (boring) optimisations.

In C the iterators provided by the reference implementation are extremely fast; It simply doesn't matter that you might have to accept every node in a document when you're working in C.  In a dynamic language like PHP it really does matter, even if the objects representing the nodes are short lived. Again when you access a parent or child node in C, you are just doing pointer arithmetic (hidden behind function calls), it's all simple stuff. When it comes to a dynamic language there is all kinds of baggage attached to the object (and even the read operation itself), additional allocations and other such instructions must be executed before the C pointer can be passed into user land.

While the iterators from the reference implementation are fast, they are not smart - they don't really need to be, as explained. When it comes to inspecting a document (before conversion for example, or for editing), the kind of code you need to write in any language consists of complicated nested loops and or recursive calls, it's long and complicated, and difficult to get right.

Introducing CQL

CQL - CommonMark Query Language is a feature that has been developed alongside the CommonMark PHP extension, which solves some of the problems of iterating through a tree structure in a dynamic language by allowing the user to express as a string how to travel through the document and which nodes to return.

CQL consists of a lexer and parser, a compiler for a small set of instructions, and a virtual machine for executing the instructions.

Syntax

For the real geeks, they can just look at the context free grammar, for the rest of us, a query describes a path through a document:

/firstChild/children

The above query will return the children of the first child node of a tree.

firstChild, lastChild, parent, next, previous, and children are all accepted paths.

children can accept sub queries (but cannot have other paths following it, because think about it ...):
/firstChild/children[ /children ]
The above query will return the children of the children of the first child node of a tree.

children can also accept a constraint:

/firstChild/children(BlockQuote)

The above query will return the children of the first child node of a tree that are BlockQuote objects.

Constraints may be or'd together:

/firstChild/children(BlockQuote|Paragraph)

The above query will return the children of the first child node of a tree that are BlockQuote or Paragraph objects.

Subqueries with constraints can also have subqueries:

/firstChild/children(BlockQuote)[ /children(Paragraph) ]

The above query will return Paragraphs that are children of BlockQuotes that are children of the first child node of a tree.

Constraints and sub queries may be nested ad-absurdum to describe a path to take through the tree. The form of the queries I have used here is for readability only, whitespace is ignored, and content after # is ignored.

Execution

Having lexed and parsed your query into an abstract syntax tree, CQL compiles the AST into discrete instructions for travelling through the tree. We're going to skip over a description of that AST because it's throw away and boring. Let's have a quick look at the result of compiling the AST, the instructions:

Each instruction has an input value (IV), and an output value (RV) or JMP target (#T), in addition it has an extended value (int) for storing constraints, and probably other things in future.

We'll start simple, with /firstChild/lastChild, which compiles to:

For simplicity, you can consider the numbers in IV and RV columns variables, the first instruction FCN sets 1 to the first child node of 0, the second LCN sets 2 to the last child node of 1, and the third instruction ENT dispatches a call to the caller of the function with the address of the node at 2.

Remembering that these "variables" are just addresses, no zvals, no php vars, all very low level stuff.

It gets a little more complicated when it comes to children, /firstChild/children compiles to the following instructions:

The first instruction FCN sets 1 to the first child node of 0, the second instruction sets the first child node of 1 to 2, the third ENT dispatches the enter call. The next instruction NEN sets 3 to the node next to 2 in the tree, the next SET instruction sets 2 to 3, and the next JMP jumps to ENT if 3 is positive, creating a loop until all the children are consumed.

The textual description of a query like:
/children(List)[ /children(Item)[ /children(Paragraph) ] ]
would be extremely boring, but here's what that query looks like:
The only new instruction is CON, which will skip nodes that do not match the constraint given.

The virtual machine that executes the instructions looks like:

Making execution of the query extremely efficient, much more so than you would be able to write in PHP.

PHP API

Proper documentation for the PHP API will become available in the manual soon, here's a quick description for those that want to get started:

The CommonMark extension declares \CommonMark\CQL:
class \CommonMark\CQL {
    public function __construct(string $query);
    public function __invoke(\CommonMark\Node $node, callable $enter);
}

The callable provided $enter should have the prototype:
function (\CommonMark\Node $root, \CommonMark\Node $node)
and will be invoked by CQL on ENT instructions.

Get Involved or Wait :)

I am not finished writing tests for CQL yet, so it currently lives in a feature branch. It will be included in the next release of the extension, probably in the next couple of weeks.

If you feel like being helpful, you could come and submit a PR for tests ...

Peace out phomies ...

No comments:

Post a Comment