Wednesday, 13 February 2019

Parallel PHP: The Next Chapter

Some years ago, to prove some people on the internet wrong, and because I had a break from normal work - the first such break in years - I decided to write pthreads. My memory fails me a little, but from what I can recall, nobody actually saw that first version, I developed the idea over the following weeks and months and was allowed to publish this work to PECL. It was my introduction to serious internals programming.

The thing I was proving wrong is that PHP is not designed to be used in threads: This is flat out wrong and has been since the 22nd of May in the year 2000, when TSRM was merged into PHP. TSRM - Thread Safe Resource Manager allows you to build PHP such that it can be embedded in a threaded server, like Apache. These builds are colloquially referred to as ZTS - Zend Thread Safe. PHP is very much designed to be used in threads. You may hear people say that TSRM is unstable, or that it's not safe ... there were mistakes in the original implementation, like any software. But it is safe, and it is theoretically sound, and it has been in use on Windows as the primary mode of executing PHP since shortly after it was merged, necessarily so because of a lack of support for proper forking.

What it's not designed to do, or rather I should say what has been given no attention except by me, is exposing threads to userland. The reason for this is the architecture that PHP has, often referred to as "share nothing" seems to be antithetical to threads: Normally, when you start a thread, it operates in the same address space as the thread or process that created it, they share everything. I've been using threads in other languages for a very long time, and the very thing that other people think makes PHP an unsuitable candidate makes me think it more suitable than other languages. The thing that makes programming with threads hard is precisely that they share data, the cognitive overhead increases with each additional thread you create. The models you have to build in your head become unreasonable and prone to mistakes. That is why more modern languages like Go try to hide all of the complexities of threading behind a nice simple API, so that the programmer doesn't actually need to learn about the intricacies of how to use a condition variable, or a mutex, or when to synchronize, they only have to learn how the simple API works.

Having never written a threading API before pthreads, and being left entirely on my own to do it even when the code became public, maybe I made some questionable decisions. I couldn't accept that using parallelism could be easy, I would repeat like a mantra that threading is hard and API's can't solve that. I wanted pthreads to expose the same kind of API that Java has, and my focus could not be shifted by reason. I vaguely remember the first time I went into IRC on internals to talk about pthreads, and people, including Rasmus, tried to reason with me that I was maybe making a mistake, that threading at the frontend of a website doesn't make sense, others said they would have preferred a simpler API ... these pleas fell on deaf ears, and I regret it. I spent many hundreds, possibly thousands of hours writing and rewriting pthreads until it is what you see today, a kind of monster that about 4 people really understand excluding myself, that only the same number of projects have really managed to deploy with any success.

A slight tangent: Threading at the frontend of a website, on the face of it, doesn't make sense: If you have a web request that creates a "reasonable" number of threads, let's say 8, and 1000 clients come along at once, you are asking your hardware to execute 8*1000 threads not including the threads or processes that done the creation, this is unreasonable and cannot possibly scale (with a 1:1 threading model, which is what pthreads has). That said, other languages do manage to integrate parallelism into the web response, but it's tricky, and takes a lot of thought and expertise. I've never suggested that you should build software in this way, and would never suggest it, eventually I prohibited the use of pthreads in anything but the CLI in an attempt to force users to be reasonable.

Tangent over: The great thing about being wrong, and acknowledging that you are wrong, is that you have the opportunity to learn, and do better. If we were never wrong, programming would be boring, we would just chug out code for our entire lives, and miss out on the feeling of truly understanding something for the first time, a feeling I love and cherish, and that keeps me at my keyboard.

I've been aware of my mistakes for some time, but pthreads does have some large projects relying on it, and mostly it's development is controlled by other people now. Occasionally I will give advice or commit something, but I can't, in good conscience, tell anyone to use it, it's simply too hard.

While aware, there was nothing driving me to write another API, until recently when Zend made their intention to merge the JIT  they have been working on for years into PHP. Just think for a moment what it would mean to be able to execute machine code in parallel in user land PHP ... this is not a thing I could have ever imagined happening all those years ago, but it would seem a possibility in the not too distant future.

pthreads has to jump through so many hoops to make threads work and provide the API that it does, that it's not reasonable to talk about it being able to execute machine code.

Recently, I set to work on a new threading API, named Parallel, it is not an exact clone of any existing threading API, it is an API focused on being simple and hiding the complexity inherent in utilising parallelism in your application, it is also focused on being forward compatible with the JIT, for that day when we can actually execute machine code in userland and in parallel.

I should mention now that the current implementation of the JIT, which is not finished, doesn't actually have proper support for TSRM, a fact that only became evident in the past few days. However, the conversation some of us had with Dmitry (the author of the JIT) about the lack of support for ZTS in JIT has inspired him to look at ways to make ZTS with and without a JIT much more efficient. So, it's highly likely that this support will come, although I can't say when.

While the JIT is an exciting prospect for parallelism, I'm also excited to be able to provide a really nice, simple API, that any PHP programmer can understand and use. For the next year at the very least, the JIT doesn't exist for most PHP developers, and you can get to work getting to know how to use parallel ...

Parallel is not complete, but is stable: More features are planned, something like Go's channels would be a nice addition, and I've already started to think about and discuss the implementation of this with other internals developers.

I wish you all the best of luck ... and for those people thinking about using pthreads for new code: don't be silly.

Monday, 28 January 2019

Running for Coverage

Today we're going to look at the history and the future of coverage collection in PHP.

History is the easy bit: For most of the history of PHP, Xdebug has provided the only implementation to php-code-coverage. Simple.

Then in 2015, just after phpdbg was merged into PHP, some clever sausages extended the instruction logging facility that I wrote into phpdbg for internals developers in order to provide another implementation to php-code-coverage. To paraphrase a popular book "and they saw that it was good" ...

But was it good !? It was fast, it didn't add any complication to phpdbg and for a lot of people, they didn't notice (or didn't care about) the mistakes phpdbg was making.

That's right, it was merged with mistakes ...

You might think the job of a coverage collector is just to hook into Zend and find out what lines have been executed any way it can, and on the face of it, that's true (and difficult to get wrong, you might think).

But, think more carefully about it, and you'll realise that actually a coverage collector must know what instructions are executable (and important to the user), if for no other reason than Zend will insert an implicit return statement into all functions (or top level code, file) even if you have an explicit one. It does this so that all functions certainly end with return, this is important for boring internal reasons as well as obvious ones.

A graphic example of why it's important to know which instructions are executable is this:

/* 1 */ function foo($bar) {
/* 2 */    if ($bar) {
/* 3 */        return true;
/* 4 */    }
/* 5 */ }
At the end of this function, on line 5, Zend inserts that implicit return, a collector that doesn't know if that return is an executable instruction must ignore it, and so will report inaccurate coverage of the function if the first control path is taken.

Quite early on in the life of Xdebug, Derick developed branch analysis. At the time, it was the only implementation of branch analysis for PHP code, so a very valuable thing. Branch analysis allows Xdebug to determine that the implicit return is important, and so mark it as coverable/executable code to be included in any trace.

In addition the branch analysis in Xdebug is the basis for its support for branch or path coverage, which is in my opinion, and Derick's, the most valuable feature of Xdebug's coverage, but unfortunately unusable in its current state. Although there are plans to improve that, and I may even help. There's no real bias here.

phpdbg has no such analysis, and used a fast, but inaccurate method that results in ignoring all implicit returns, executable or not. This makes the reports phpdbg generates less than accurate, and less than trustworthy ... but it does do it fast ... ish.

So now it's 2015, we have two debuggers, both with support for coverage, one of them obviously superior to the other, and one of them fast, but inferior.

Now I want you to question whether it makes sense for a debugger to have support for code coverage at all; gdb has no such thing, I haven't used any Java since 2015, but I don't remember seeing code coverage collection as a feature of any debugger I ever used for it, quite late on in the game Visual Studio did get support for coverage, but not as an extension of debugging, but a feature of the IDE itself ...

The answer is no, it doesn't really make sense, the two things interfere with each other. A debugger must gain such a degree of control over the execution environment it is debugging that they come with unavoidable overhead, they may even need to change (or have the ability to change) the path that is taken through code which is antithetical to collecting coverage. A coverage collection tool needs to do the opposite of a debugger: Change as little as possible, try not to slow down the execution of code more than is absolutely necessary to do your job - Not a concern for a debugger, nobody really cares if a debugger is slow - it will spend a lot of its time paused, waiting for you to figure out what to do next !

I should say now that Derick flat disagrees me with here, and his reasoning is not wrong: Xdebug started as a debugging aid, the more fully blown features such as step debugging and profiling were added later. So adding coverage to the set of tools wasn't crazy, it makes sense.

But we have two debuggers that support coverage, we're so spoiled, or unlucky ... or foolhardy ... or doomed ...

The fact is that while Xdebug has superior collection to phpdbg, most people disable Xdebug in their CI, and in their development environments for performance reasons. If they run coverage in CI, it's for a small project, or they use phpdbg and put up with the mistakes (or don't know about  them).

When Sebastian Bergmann was conversing with the clever sausages that made phpdbg support coverage collection, he even warned them they were following a doomed path, and suggested it might be better if code coverage was a standalone extension, nevertheless he merged their work and we all moved forward.

Doing something about this situation has been on my todo list since shortly after the driver for phpdbg was merged into php-code-coverage, not very high up on my todo list, but present.

Quite recently I saw a blog post from Sebastian about making Xdebug collect coverage faster, he failed to mention phpdbg in the whole article which was questioned by reddit users and twitterers alike. He didn't mention it because he knows very well what its limitations are, and didn't want to encourage people to use something that makes mistakes. Totally fair.

But, someone on the internet said a thing and my mind became occupied, completely, with fixing this problem. There's no good reason that in 2019, we can't collect accurate coverage, and fast.

I set to work on PCOV, which is a standalone extension that implements the kind of interface that php-code-coverage needs, it does this with as little overhead as possible, as it should. At first, I copied the faulty method of ignoring executable returns from phpdbg, I done this to prototype it as fast as possible and see just what kind of performance we can get. The results were remarkable, the overhead was so very low that it managed to outpace phpdbg on every test suite I ran, by a considerable margin.

Even though flawed, I thought this is worth sharing, so I made it nice and made a readme and opened a pull request to have the PHP part of the driver merged into php-code-coverage. But I didn't stop thinking ...

I then read from a post on Dericks blog that mentioned he was looking for ways to improve the performance of coverage collection in Xdebug. In the blog post is a one liner about preferring correctness over speed.

I absolutely agree with preferring correctness over speed, and I couldn't sleep knowing that I had just introduced a known flaw in brand new code, sure it was faster than phpdbg, but objectively not better at the job of collecting accurate coverage.

It so happens that Xdebug is not the only software in the ecosystem that performs analysis of code, in fact Optimizer, part of PHP for many years now, also performs analysis and is the "source of truth" for what is an executable Zend instruction, since non-executable ones are destined to be removed automatically during one of its many optimization passes.

You will notice that nobody ever complains that Optimizer is slow to analyze code, the reason for this, is that it's not slow at all, it has a very succinct implementation of a control flow graph ... not much of PHP is succinct, but so well abstracted is this feature of Optimizer that you can lift it from opcache and drop it into whatever you like with very little work.

That is what I did next ... So now PCOV and Zend agree absolutely, and always will, about what is executable code.

It may seem presumptious of me to talk about the future of coverage being PCOV, but humbly, I'd like to suggest that it should be, and like you to consider that I'm talking about the distant future, not tomorrow: I think phpdbg and maybe Xdebug should drop that feature altogether and maybe we can team up and add some really cool but usable and fast features to PCOV that php-code-coverage always wanted, such as branch or path coverage, a much superior criteria than line coverage.

I should make clear at this point that Derick is not so keen on that idea currently, and would like to pursue his own path for Xdebug, with plans to refactor and improve upcoming Xdebug releases, possibly containing the many features within Xdebug - so it is able to behave only as a profiler, or a debugger, or a collection tool. Honestly I would be surprised if he wanted to drop anything from such mature and widely deployed software, but let's see where we are in 5 years, perhaps ...

At this moment, you will find it hard to use PCOV in your projects as I'm waiting for Sebastian to review the pull request and make his decision, presumably about the version of php-code-coverage that PCOV will first be included in.

It's Monday morning, and I've got nothing better to do than write a blog post ... When you can use it easily, another post will follow.

That's all for now, enjoy your week :)

Sunday, 27 January 2019

Faking It

Fig 1. A Mockingbird
As well as mentoring and code review one of my main tasks at work is to improve the test suites and improve the testing and development methodologies we use. This is no small task and has resulted in the publication of a few extensions, one of them is uopz.

Before we continue; I work in the real world, where not all code was written yesterday using the best standards and best methods, and it's fine to say "just fix your code", but totally unrealistic, we have to deal with reality.

You may not hear people raving about using uopz, because for most of us, if you are doing the sort of things that uopz makes possible, there is something wrong with the way you are writing code. This is technically true, of course, nevertheless, we have work to do.

When I first started my current job, our test suite was bound tightly to runkit, and failed all the time. When I wrote uopz and we adopted it, it stabilised and developers could once again get on with their work. All the while we would repeat to each other that we would move away from uopz by improving our tests and code. This hasn't really happened, instead, because uopz allows certain crazy things, they are the crazy things that were being done in tests.

It's five years later, and uopz takes up a fair amount of my time with each new PHP version, it's quite a headache, for a temporary solution. Realising that actually this is entirely my fault because I gave them the tools to work like this, I decided we needed new tools.

Some time ago, I wrote an extension called Componere (latin, composer ... I like latin names), the purpose of Componere is to allow the programmer to (re)define complex types at runtime. I showed this to a couple of colleagues, got some "wow, cool", but they never picked it up. I later realised that they didn't pick it up because it relies on the author having at least some knowledge of the way classes are built internally. So even though extremely powerful, it got ignored.

We required a higher level API, and so I've written a mocking framework in PHP, built on top of Componere that is almost without limitation. It's name is "mimus" which is taken from Latin, "mimic" and shared with the first part of the Latin binomial name for the animals everyone knows as Mockingbirds.

I am fully aware that the ecosystem contains within it many mocking frameworks, however they all have the same set of problems, limitations on what methods you can mock or how. This may be good enough for the vast majority of people, you can just write your code so that you can mock the parts you need at test time. However, if you have 3M LOC, it's not so simple; We need to do some of the things that can't be supported properly if you write the whole framework in PHP, such as stubbing privates, statics, finals. We also have 14 tonnes (number pulled from the air) of tests split across many suites and projects, this makes invoking PHP parsing in PHP several tens of thousands of times unrealistic.  Wipe that cringe off your face, real world, remember ...

While you don't hear people raving about uopz, I happen to know it's used in some very large scale deployments of PHP: I know that there's a number of people out there doing exactly the same sort of horrible things in tests that we were doing, and that mimus is freeing us from, slowly.

We're a month into the switch from uopz and hacking the engine apart to a more modern, more sensible world. The developers are really enjoying themselves using mimus too, which is a bonus, and probably born of the fact that they don't have to feel quite so "dirty" when writing tests.

I'm not going to repeat the readme for mimus here, or show any code, because it's Sunday and you probably don't want that. Tomorrow morning, before you write another test that uses uopz, check out mimus ...

Enjoy the rest of your Sunday ...

Boxes of Sand

Fig 1. A Sandbox

Sandboxing is a technique used in testing and security to execute unsafe, or untrusted code in a safe environment. There are different levels of sandboxing: In security a sandboxed environment may refer to a (virtual) machine dedicated to the execution of unsafe code. In testing, a sandbox may refer to a thread or process dedicated to the same purpose.

For most of the history of PHP, sandboxing has been provided by Runkit, and the first thing I'd like to do is clear up some confusion that is present in the manual:
Instantiating the Runkit_Sandbox class creates a new thread with its own scope and program stack.
This has never actually been true. Runkit never created a thread for the sandbox: What it created was a PHP thread context provided by a much misunderstood layer of software called TSRM. It is TSRM that provides the "share nothing" architecture that PHP requires when running in a threaded SAPI, such as Apache on windows. Builds of PHP that use TSRM are colloquially known as ZTS - Zend Thread Safe(ty).

Runkit "switched" between the parent and the sandbox context to execute code as if it were in another thread, but there was never actually another thread, so this extremely old documentation is very misleading. For the sake of nostalgia, I don't intend to fix the documentation, it has been that way for 13 years, and the feature is all but dead, so it's not affecting anyone anymore.

There is a version of runkit available for PHP7, however, sandboxing was removed because it looks like the new maintainer couldn't figure out how to make it work, maybe mislead by current documentation.

The only people who care about TSRM/ZTS in PHP are the windows people, and myself. The windows people need it for Apache, and I need it for pthreads, what's more I recognise the value in the abilities that TSRM provides. So when PHP 7 quietly improved the performance of ZTS with improvements to TSRM, nobody really noticed, I've even heard people say TSRM is going to be removed. It isn't.

Those improvements broke the ability to abuse TSRM as Runkit did, for many boring reasons, runkits sandbox cannot work as it did before, it's not possible.

I've bumped up against runkit before; uopz exists because 5 years ago (roughly) I started a new job and was presented with test suites that would crash all the time because runkit was doing bad things. This is the reason I didn't and won't work on runkit, but do think a sandbox is a useful thing.

So I wrote Sandbox, it's a PHP 7.1+ extension (because nobody is using 7.0, right !?) that requires a ZTS build of PHP and creates real sandboxes. A Sandbox is truly a separate thread, isolation is provided by TSRM as it always has been, this means a Sandbox thread is as isolated from the thread that created it as are two PHP threads inside apache, almost complete isolation.

While the Runkit implementation provided a lot of methods to affect the context it created, the new implementation provides just an entry point into the sandbox thread:

$sandbox = new \sandbox\Runtime();

    /* I will execute in the sandbox thread */

This doesn't give you multi-threading, the sandbox and parent threads are synchronized so that no user code executes in parallel, this is the safest (only) thing to do for a sandbox.

The code executed in the sandbox may do anything up to but excluding making PHP segfault, and not affect the thread (process) that created it. It may exhaust memory limits and time constraints and not only will the parent thread remain stable, so will the child (you may re-enter after such failures).

By passing an array of INI options to the constructor of the runtime you can gain great control over the sandbox.

I'm afraid I haven't created a PHP manual for Sandbox, so that job is up for grabs ... I'm quite happy to leave it quietly in a corner and have the kind of people who read my blog know about it, but not everyone who reads the manunal, although I won't object to anyone commiting manual entries.

Thanks to my awesome QA/RM team, Anatol and Remi, who I adopted from PHP for my own purposes; Sandbox is available on PECL, windows builds are available, and it's available from Remi's repositories.

So, now we have a sandbox for PHP7.1+ ... have fun using it ...

Sunday, 3 June 2018

Preface to idbg

Fig 1. A tweet from earlier this month
We already have several options for debugging code within the PHP ecosystem. XDebug is extremely mature software, and phpdbg has been slowly gaining traction also, if for no other reason than it's very fast to collect code coverage compared to XDebug.

Although phpdbg and XDebug are different from one another, they have some things in common: They are both complicated (to varying degrees), and they are both written in a language that 97% of the people reading this text do not understand (number pulled from air, based on nothing at all).

Why do we need a debugger at all?

Slightly tangential perhaps ... Debugging is a necessary part of writing code; If you disagree with this statement, then I don't know what you are talking about.

If debugging to you means sprinkling code with debugging statements like var_dump, or print_r, then I implore you to learn how to use a debugger; You are wasting a lot of time. I don't say this from a position of arrogance because I happen to be one of the authors of phpdbg, but from a position of experience; I remember trying to write code before I had a good handle on using a debugger. 

Sprinkling code with debugging statements is like crouching on a stool in the corner of a cockroach infested room and hoping that blowing upon the blanketed floor will destroy and eliminate the little beasts creating the blanket.

Using a debugger is like having an army of nano bots at your disposal, each one trained exquisitely in a top nano-bot-training-camp, they live to kill cockroaches, some of them also have mean looking tattoos, chew tobacco, and spit on the ground at the start of every sentence ...

I think we understand each other ...

Why do we need a debugger written in PHP ?

Here are some statistics (from github api):
  • XDebug has had 50 contributors in the 7 years it has been on github
  • phpdbg has had a handful of contributors (20-30) in the 4 and a half years it has been on github
  • PHPUnit has had 342 contributors in the 8 and a half years it has been on github
  • phpstan has had 70 contributors in the 2 and a half years it has been on github
XDebug predates it's github repository (by a very wide margin), still it doesn't matter for the point I'm trying to make here: In the PHP ecosystem, we have very many very talented programmers, with a whole host of knowledge about how the PHP engine works - they may have been using it for their entire professional career - they are able to write and contribute to arguably comparatively complicated software like PHPUnit or phpstan. Alas we have vanishingly few programmers in the ecosystem that are able to improve, fix, or develop in any way software like phpdbg or XDebug, and I think it's mostly because of the language they are written in.

You might also just like to scan the number of contributors to projects like Laravel and Symfony ... although I think these numbers less relevant, they are surprisingly high.

It's not all about the language, the domain specific knowledge required to implement a debugging engine might not be so disseminated. But maybe it doesn't need to be ...

You may not find these arguments convincing, you may not be convinced that we need another debugger written in any language, after all XDebug is extremely mature, and using phpdbg makes you at least 20% cooler (in the same way as go faster stripes make any vehicle 20% faster). That's a perfectly rationale position to take, and I can't think of another way to argue my case, you can probably stop reading ;)

Domain Specifics

I don't know how obvious it is that it's not reasonable to talk about implementing a debugger entirely in PHP; The kind of control you need over the engine just isn't reasonably attained in userland by default.

The debugger itself, the thing that interacts with a person or an ide can be written in PHP alone, and is much easier to write in PHP. But the core of the "debugging engine" (terminology borrowed from dbgp specification) should be written in C.

krakjoe/inspector is a disasembler and debugging kit for PHP, it exposes the necessary API for the development of a debugger in userland. It is an advanced extension of the existing Reflection API, giving it a shallow learning curve for anyone already familiar with Reflection.

While it's annoying that we still must have a binary dependency, I'm hoping that inspector becomes a defacto part of php installations in the not too distant future. Although I have no intention of making an RFC to merge inspector into core - it belongs outside of core, the release cycle in core does not lend itself to new software and there is nothing to be gained by merging. Being a defacto part of installations doesn't necessarily mean merged into core.

Code or STFU

Fig 1. idbg help
This isn't just pipe dreams, the PHP code exists, it's alpha quality and largely untested ...

There is much to do and you shouldn't design your workflow around this (or any alpha quality software) yet.

What you should do is start reading code, testing, and opening pull requests ... consider me waiting ...

Tuesday, 22 May 2018

PHP allows for the design of X

Fig 1. A thing I said
Starting complicated twitter conversations should be avoided, I know this, and yet blurted this out on twitter recently ...

This was met with a flurry of responses and I couldn't reasonably reply in tweet form. I'm going to respond to some of those tweets (indirectly) and further explain my original tweet.

PHP is not always the right tool for the job

First and foremost, I was misunderstood by some people; They thought I was saying you should use PHP for everything. Obviously, that would be an untenable position, which I do not assume.

Give a task to a polyglot and they won't spend time enumerating for possible exclusion all of the languages they know. It doesn't work like that, you don't start thinking about the most unsuitable language for a task and somehow work your way to a suitable language. Choosing a suitable language is a thing you want to call an instinct, but it isn't an instinct, it's guided by an understanding of the task, and prior knowledge of domain support among your chosen poisons ... I mean languages.

There are totally legitimate reasons to choose other languages over PHP, even in the domain where PHP excels - on the web. But it doesn't have very much to do with PHP, and has everything to do with the chosen language and the task. You likely weren't thinking about PHP when you made the decision.

PHP is not a templating language

Whenever someone says "but PHP was designed as a templating language", I almost want to cry.

Who actually cares what PHP was in the year 1997, the number of lines of code from that software left in PHP is minuscule, if there are any present at all.

In the year 2018, we don't even care what PHP 5 was, we don't care about it's shortcomings, because we should not and do not use it.

Today, when we talk about PHP, we are talking about PHP 7 ... here are some actual facts (the things you can't have your own opinions on):
  • PHP 7 is fast
  • PHP 7 is a general purpose scripting language
I'm a C programmer, I spend most of my time writing C, and spend some time at levels below that playing with machine code, JIT compilers and so on. When I say PHP 7 is fast, I mean to say that as a C programmer, it's difficult to write code (of equivalent complexity) as efficient as the code Zend generates in the vast majority of cases. It's also as near as makes no difference impossible to JIT Zend opcodes into machine code and have them be more efficient, the facts of the matter are that the assembly that is generated when Zend is compiled is as efficient as any assembly you could hand craft inline, or generate just in time (that's not a guess, I've personally tried  both hand crafting inline asm and JIT compilation of ze3 opcodes).

Obviously PHP is stuck with one data structure, but it's not just a dumb HashTable in PHP 7 anymore, it's smart and will perform optimally most of the time. The structure of a HashTable and the shortcomings of those structures are less important when our applications are heavily object orientated. Reading/writing/interacting with properties on PHP 7 objects is almost entirely unaffected by those things; Given a warm runtime cache, reading a property from an object consists a relative load (a very simple machine code instruction), there is no ht lookups involved. This is also true of HashTables in some cases (they can behave like C arrays).

PHP allows for the design of X

When a project like AMP shows up on the scene, you can't say "PHP wasn't designed for asynchronous execution", it's a nonsensical statement since PHP is a general purpose language, given that support has emerged in this new domain, as a matter of fact PHP does support it, and not accidentally.

Is the support for this new domain as mature as another language in your arsenal ? I don't know what languages you have, whatever this is not a reason to assume that the people working on AMP, or X, are wasting their time because of some imaginary (and it is mostly imaginary) shortcoming in PHP.

Reddit was recently discussing a GUI extension I wrote; It's very frustrating to hear people who don't really know what PHP is capable of decrying it as a waste of time, or no better than software from 10 years ago.

Almost every extension I write gets the same sort of response: "Just because you can, doesn't mean you should". What they are communicating is "It doesn't matter if you can, you shouldn't", which is somewhere between silly, and harmful.

It really does matter if you can.

When support emerges for a new problem domain, let's be pragmatic and observe that expanding the horizons of PHP in any direction is good for the community that relies on PHP (and maybe PHP alone) to make a living. Let's not rush to take new solutions to production tomorrow, but let's not dismiss anything out of hand because of some imaginary short coming in PHP. 

Ideally, let's find time to learn about the new solution to see if it's useful to us, perhaps try to use it in our prototypes and drafts, and in so doing improve it.

PHP 7, as a matter of fact, is internally designed to bend to the programmers will ...

When support emerges for a new domain, take that as proof that PHP allowed for the design of X.

Monday, 16 April 2018

An Introduction to CQL

Recently I have been working on a CommonMark extension for PHP7. It is based on the reference implementation in C, linking to it rather than re-implementing the spec.

The reference implementation in C is extremely fast, and so the extension has a focus on performance, trying to create PHP objects only when necessary, among other (boring) optimisations.

In C the iterators provided by the reference implementation are extremely fast; It simply doesn't matter that you might have to accept every node in a document when you're working in C.  In a dynamic language like PHP it really does matter, even if the objects representing the nodes are short lived. Again when you access a parent or child node in C, you are just doing pointer arithmetic (hidden behind function calls), it's all simple stuff. When it comes to a dynamic language there is all kinds of baggage attached to the object (and even the read operation itself), additional allocations and other such instructions must be executed before the C pointer can be passed into user land.

While the iterators from the reference implementation are fast, they are not smart - they don't really need to be, as explained. When it comes to inspecting a document (before conversion for example, or for editing), the kind of code you need to write in any language consists of complicated nested loops and or recursive calls, it's long and complicated, and difficult to get right.

Introducing CQL

CQL - CommonMark Query Language is a feature that has been developed alongside the CommonMark PHP extension, which solves some of the problems of iterating through a tree structure in a dynamic language by allowing the user to express as a string how to travel through the document and which nodes to return.

CQL consists of a lexer and parser, a compiler for a small set of instructions, and a virtual machine for executing the instructions.


For the real geeks, they can just look at the context free grammar, for the rest of us, a query describes a path through a document:


The above query will return the children of the first child node of a tree.

firstChild, lastChild, parent, next, previous, and children are all accepted paths.

children can accept sub queries (but cannot have other paths following it, because think about it ...):
/firstChild/children[ /children ]
The above query will return the children of the children of the first child node of a tree.

children can also accept a constraint:


The above query will return the children of the first child node of a tree that are BlockQuote objects.

Constraints may be or'd together:


The above query will return the children of the first child node of a tree that are BlockQuote or Paragraph objects.

Subqueries with constraints can also have subqueries:

/firstChild/children(BlockQuote)[ /children(Paragraph) ]

The above query will return Paragraphs that are children of BlockQuotes that are children of the first child node of a tree.

Constraints and sub queries may be nested ad-absurdum to describe a path to take through the tree. The form of the queries I have used here is for readability only, whitespace is ignored, and content after # is ignored.


Having lexed and parsed your query into an abstract syntax tree, CQL compiles the AST into discrete instructions for travelling through the tree. We're going to skip over a description of that AST because it's throw away and boring. Let's have a quick look at the result of compiling the AST, the instructions:

Each instruction has an input value (IV), and an output value (RV) or JMP target (#T), in addition it has an extended value (int) for storing constraints, and probably other things in future.

We'll start simple, with /firstChild/lastChild, which compiles to:

For simplicity, you can consider the numbers in IV and RV columns variables, the first instruction FCN sets 1 to the first child node of 0, the second LCN sets 2 to the last child node of 1, and the third instruction ENT dispatches a call to the caller of the function with the address of the node at 2.

Remembering that these "variables" are just addresses, no zvals, no php vars, all very low level stuff.

It gets a little more complicated when it comes to children, /firstChild/children compiles to the following instructions:

The first instruction FCN sets 1 to the first child node of 0, the second instruction sets the first child node of 1 to 2, the third ENT dispatches the enter call. The next instruction NEN sets 3 to the node next to 2 in the tree, the next SET instruction sets 2 to 3, and the next JMP jumps to ENT if 3 is positive, creating a loop until all the children are consumed.

The textual description of a query like:
/children(List)[ /children(Item)[ /children(Paragraph) ] ]
would be extremely boring, but here's what that query looks like:
The only new instruction is CON, which will skip nodes that do not match the constraint given.

The virtual machine that executes the instructions looks like:

Making execution of the query extremely efficient, much more so than you would be able to write in PHP.


Proper documentation for the PHP API will become available in the manual soon, here's a quick description for those that want to get started:

The CommonMark extension declares \CommonMark\CQL:
class \CommonMark\CQL {
    public function __construct(string $query);
    public function __invoke(\CommonMark\Node $node, callable $enter);

The callable provided $enter should have the prototype:
function (\CommonMark\Node $root, \CommonMark\Node $node)
and will be invoked by CQL on ENT instructions.

Get Involved or Wait :)

I am not finished writing tests for CQL yet, so it currently lives in a feature branch. It will be included in the next release of the extension, probably in the next couple of weeks.

If you feel like being helpful, you could come and submit a PR for tests ...

Peace out phomies ...