Monday, 16 April 2018

An Introduction to CQL

Recently I have been working on a CommonMark extension for PHP7. It is based on the reference implementation in C, linking to it rather than re-implementing the spec.

The reference implementation in C is extremely fast, and so the extension has a focus on performance, trying to create PHP objects only when necessary, among other (boring) optimisations.

In C the iterators provided by the reference implementation are extremely fast; It simply doesn't matter that you might have to accept every node in a document when you're working in C.  In a dynamic language like PHP it really does matter, even if the objects representing the nodes are short lived. Again when you access a parent or child node in C, you are just doing pointer arithmetic (hidden behind function calls), it's all simple stuff. When it comes to a dynamic language there is all kinds of baggage attached to the object (and even the read operation itself), additional allocations and other such instructions must be executed before the C pointer can be passed into user land.

While the iterators from the reference implementation are fast, they are not smart - they don't really need to be, as explained. When it comes to inspecting a document (before conversion for example, or for editing), the kind of code you need to write in any language consists of complicated nested loops and or recursive calls, it's long and complicated, and difficult to get right.

Introducing CQL

CQL - CommonMark Query Language is a feature that has been developed alongside the CommonMark PHP extension, which solves some of the problems of iterating through a tree structure in a dynamic language by allowing the user to express as a string how to travel through the document and which nodes to return.

CQL consists of a lexer and parser, a compiler for a small set of instructions, and a virtual machine for executing the instructions.


For the real geeks, they can just look at the context free grammar, for the rest of us, a query describes a path through a document:


The above query will return the children of the first child node of a tree.

firstChild, lastChild, parent, next, previous, and children are all accepted paths.

children can accept sub queries (but cannot have other paths following it, because think about it ...):
/firstChild/children[ /children ]
The above query will return the children of the children of the first child node of a tree.

children can also accept a constraint:


The above query will return the children of the first child node of a tree that are BlockQuote objects.

Constraints may be or'd together:


The above query will return the children of the first child node of a tree that are BlockQuote or Paragraph objects.

Subqueries with constraints can also have subqueries:

/firstChild/children(BlockQuote)[ /children(Paragraph) ]

The above query will return Paragraphs that are children of BlockQuotes that are children of the first child node of a tree.

Constraints and sub queries may be nested ad-absurdum to describe a path to take through the tree. The form of the queries I have used here is for readability only, whitespace is ignored, and content after # is ignored.


Having lexed and parsed your query into an abstract syntax tree, CQL compiles the AST into discrete instructions for travelling through the tree. We're going to skip over a description of that AST because it's throw away and boring. Let's have a quick look at the result of compiling the AST, the instructions:

Each instruction has an input value (IV), and an output value (RV) or JMP target (#T), in addition it has an extended value (int) for storing constraints, and probably other things in future.

We'll start simple, with /firstChild/lastChild, which compiles to:

For simplicity, you can consider the numbers in IV and RV columns variables, the first instruction FCN sets 1 to the first child node of 0, the second LCN sets 2 to the last child node of 1, and the third instruction ENT dispatches a call to the caller of the function with the address of the node at 2.

Remembering that these "variables" are just addresses, no zvals, no php vars, all very low level stuff.

It gets a little more complicated when it comes to children, /firstChild/children compiles to the following instructions:

The first instruction FCN sets 1 to the first child node of 0, the second instruction sets the first child node of 1 to 2, the third ENT dispatches the enter call. The next instruction NEN sets 3 to the node next to 2 in the tree, the next SET instruction sets 2 to 3, and the next JMP jumps to ENT if 3 is positive, creating a loop until all the children are consumed.

The textual description of a query like:
/children(List)[ /children(Item)[ /children(Paragraph) ] ]
would be extremely boring, but here's what that query looks like:
The only new instruction is CON, which will skip nodes that do not match the constraint given.

The virtual machine that executes the instructions looks like:

Making execution of the query extremely efficient, much more so than you would be able to write in PHP.


Proper documentation for the PHP API will become available in the manual soon, here's a quick description for those that want to get started:

The CommonMark extension declares \CommonMark\CQL:
class \CommonMark\CQL {
    public function __construct(string $query);
    public function __invoke(\CommonMark\Node $node, callable $enter);

The callable provided $enter should have the prototype:
function (\CommonMark\Node $root, \CommonMark\Node $node)
and will be invoked by CQL on ENT instructions.

Get Involved or Wait :)

I am not finished writing tests for CQL yet, so it currently lives in a feature branch. It will be included in the next release of the extension, probably in the next couple of weeks.

If you feel like being helpful, you could come and submit a PR for tests ...

Peace out phomies ...

Tuesday, 16 January 2018

Sensible Targets

Fig 1. Current release cycle graph
There has been a lot of talk recently about which versions of PHP you should support for your new projects or packages.

As a release manager for PHP, someone who watches the way releases evolve extremely closely, and has some sway over what gets fixed and what doesn't, and as someone who helped to draft the security classification document for PHP; I feel I have some useful things to say on this subject, so here goes ...

The Release Cycle

Every release of PHP goes through the following cycle:

Pre Release:
  • 3 Alpha, two weeks apart
  • 3 Beta, two weeks apart
  • 6 RC, two weeks apart
  • GA, roughly 6 months after the first alpha
Actively Supported:
  • For two years the PHP team make a patch release every month with bug and security fixes.
Security Only:
  • For the final year of the three year cycle, the PHP team will make a patch release when security fixes warrant a release.

What isn't a security issue ?

We have various definitions to classify the threat level posed by any security issue, but importantly for the community at large, we also have a definition of what is not a security issue:
  • requires invocation of specific code, which may be valid but is obviously malicious
  • requires invocation of functions with specific arguments, which may be valid but are obviously malicious
  • requires specific actions to be performed on the server, which are not commonly performed, or are not commonly permissible for the user (uid) executing PHP
  • requires privileges superior to that of the user (uid) executing PHP
  • requires the use of debugging facilities - ex. xdebug, var_dump
  • requires the use of settings not recommended for production - ex. error reporting to output
  • requires the use of non-standard environment variables - ex. USE_ZEND_ALLOC
  • requires the use of non-standard builds - ex. obscure embedded platform, not commonly used compiler
  • requires the use of code or settings known to be insecure
Any issue that falls under any of the above categories, even though it may have security implications for you personally, is not treated as a security issue. It may be fixed as a normal bug, but that fix will not be included in a security fix only release.

In addition, any security issue classified as having a low threat level will not necessarily be included in a security fix only release. The lowest level of threat is defined thus:
This issue allows theoretical compromise of security, but practical attack is usually impossible or extremely hard due to common practices or limitations that are virtually always present or imposed.
This also includes problems with configuration, documentation, and other non-code parts of the PHP project that may mislead users, or cause them to make their system, or their code less secure.
Issues that can trigger unauthorised actions that do not seem to be useful for any practical attack can also be categorised as low severity.
Security issues, that are present only in unstable branches, belong to this category, too. Any branch that has no stable release, is per se not intended for the production use.
Aside from those bugs which may or may not be a security issue, there is a variety of bugs that are definitely not a security issue, but may cause your project serious harm, or present serious problems for your package - No fix is forthcoming for these, the vast majority of bugs, while the release is in security fix only cycle.

What should I target ?

Targeting a security fix only release of PHP for new projects doesn't make any sense: When a release is in security fix only cycle you should be concentrating on getting old projects upgraded, and a year is plenty of time to do that. In the case of PHP 5.6, we extended the security cycle to two years. If you are reading this and thinking a year isn't long enough to do that, then there is something wrong with the way you deploy or support projects or packages: It has to be long enough, running a version of PHP without active support is dangerous for your business, reputation, and soul.

New projects or packages should obviously target an actively developed version of PHP. At the time of writing both 7.1 and 7.2 are being actively developed (there will always be two versions in active development). Whether you choose to use 7.1 or 7.2 depends on your project or package, perhaps you'd like to use some new features, and not have to worry about security cycle for nearly two years and so can reasonably target 7.2. Perhaps you have reasons to stick with 7.1, and are prepared to deal with security cycle before the year is out.

The bottom line is this: New projects or packages must target an actively developed, fully supported version of PHP.

Wednesday, 8 November 2017

Test Etiquette

Fig 1: A brigade of woobles, apparently
Today, we're going to talk about testfest, in case you have no idea what that is, here is an excerpt from the website:
Have you ever wanted to contribute to PHP but have been afraid that your C skills aren’t up for the challenge? Well, have no fear! If you know PHP, you can contribute by writing tests. Through your local user group, PHP TestFest will show you how.
I've long been trying to propogate the view that you don't need to be a master of anything (including C) to be a valuable contributor to PHP. While the language is written in C, absolutely everything else - tests, websites, admin systems, documentation - is written in languages we all understand; PHP, XML, and Javascript.

Test Fest is focused on turning you into a contributor to php-src by educating you on how to write tests for PHP, but there is lots for everyone to do.

The idea is that you attend local meetups and are guided in the practicalities of writing tests for PHP, with the goal to improve the quality of the test suite that is part of continuous integration and quality control.

I've reviewed nearly all of the pull requests made so far, and approved the vast majority of them. Some of them I'm not able to approve, for a few reasons. First, I want to look at those reasons in the hope that user group organizers and participants will see this information and so reduce the amount of rejection.

Test Coverage

Coverage shouldn't be the primary driver for adding new tests, quality should.

Nevertheless, coverage is a good indicator that you are not duplicating tests - a thing we obviously want to avoid.

There exists a web interface for viewing coverage information for PHP tests at

For anyone organizing a meetup, gcov needs to be front and centre: With such a vast number of tests, and already a reasonably high level of duplication we must make sure that coverage is not further duplicated.

For anyone writing tests: It would be helpful, although not necessary, if your pull request linked to gcov (you can only link to whole files, not individual lines) and included the details of the lines (or functions) you are trying to cover. This makes review that much easier, and so probably quicker.

Testing ZPP

ZPP - zend_parse_parameters, is in some form called upon entry to nearly every internal function in PHP. It accepts the number of arguments that were passed, and a specification string describing what the function expects to find on the argument stack. By convention, if the stack does not contain expected arguments, the function or method should return immediately leaving the return value set null (or undefined).

Because this is such a widely used function, it has been tested roughly one squillion times over, and there's not much sense in testing convention.

For anyone organizing a meetup: Please make clear that we don't want to test ZPP.

For anyone writing tests: If you are writing a test and are EXPECT'ing something that matches:
expects parameter %d to be %s, %s given
Then you are testing ZPP, unnecessarily, and while coverage may report an increase, it's improved quality we are looking for.

Clear Intentions

This may seem more subjective, harder to get right, but it's not different to writing normal code: The intention behind any code should be clear. Just as if you find yourself writing long comments in normal code, it may need to be refactored; If you find yourself writing complex and cryptic descriptions probably the test needs to be refactored.

Lastly I want to encourage anyone and everyone to find a local meetup that is participating in testfest and attend, even if it's your first time: All of us depend on the tests that PHP is distributed with, and all of us are qualified to improve the quality of those tests.

Imagery is © 2017 PHP Community Foundation.

Wednesday, 2 November 2016

Expanding Horizons

Fig 1. My view of the horizon this morning.

Recently I have been working on a new extension. It is a wrapper around libui, which is a cross platform user interface development library, that allows the creation of native look and feel interfaces in the environments it supports.

The gravitas of this may not hit you in the face, until you see something like this:

 That's a few hundred lines of PHP 7 code, moulded into an imitation of the snake game we all used to have on our phones.

We've seen other user interface extensions before in PHP, there's even a modified PHP runtime that will allow you to write GTK+ applications.

I don't know anyone that ever deployed any of those extensions, and for very good reasons; PHP5 can barely do anything without allocating a bunch of memory, and doing a bunch of other extremely inefficient things, almost everything it does is inefficient. Beyond a basic forms like application, PHP5 is close to useless.

The snake game runs on my machine using a tiny amount of CPU, and a tiny amount of RAM, while more complex things can take up more resources, they can also achieve very decent frame rates.

PHP 7 is not just fast, it's efficient; It's not unreasonable to expect it to achieve 60 frames per second.

(Note: download the video above from github, streaming it in a browser may not work).

libui is in it's infancy, and is not complete, however there is nothing that is out of bounds for us in PHP, when libui supports it.

The UI extension is documented in the PHP manual, and is already on v2. It will move forward about as quickly as libui.

I'm not a game or UI designer, I don't like to do anything you look at normally.

I wrote this extension to expand our horizons, so that you can create things I cannot imagine ... enjoy it ...

Monday, 26 September 2016

me et mentis morbum

Fig 1. What I aspire to be ...

This morning, I want to talk about mental illness ... and apologize.

The preceding sentence was rewritten 42 times, and contains the words from the first revision.

In some sense, it is difficult to use the words "mental illness" when we are talking about ourselves.

There is a kind of stigma attached to even thinking the words. Before you can think the words, you must first have the revelation that there is something wrong with the way you behave and or feel.

Some or all of the following things happen when I have to mix with other people in real life:
  • I will try to avoid it, completely.
  • I can't remember where I am, or where I am going, and appear and feel lost.
  • I struggle to recognize people, or remember names.
  • I don't move among humans like other humans do; I bump into them in closed spaces (for example, supermarkets) because I can't read their behaviour.
  • I become outwardly awkward, developing ticks (for example blinking), stuttering and repeating myself. 
  • I become inwardly slow, results in not being able to join in conversations properly:
    • Struggle to figure out if I agree or disagree with any statement.
    • Wonder if it is worth saying what I want to say.
    • Wonder if I might be stupid.
    • Fail to form a response before the conversation moved on.
    • Malform a response, with stuttering and repetition abound.
If I become conscious of any of the behaviours above, my physiology changes; I begin to radiate heat and sweat profusely. This always results in my ending it early and just leaving, often without announcing my departure.

There are a few people who make me feel somewhat at ease, just by their presence. People who are physically large (I am rather small), and attentive (go out of their way to patiently interact with me). I can't really explain this, but they seemingly have some immunity.

To compound the problems I have with social or stranger interactions, and possibly even as a result of it; I cannot remember, in any detail, for any appreciable amount of time, what has been said. With the occasional exception where the interaction resulted in a changed or new understanding of something I am interested in, I will normally have forgotten, everything, including the detail of where I was when anything happened, and usually how I got there (having driven, or ridden a motorcycle to the location), by the next day. I retain new technical information, or more generally ideas, usually, sometimes incorrectly. Of course I could be wrong about that, since I don't remember what I don't remember, but it feels that way. In the mundane, I might retain the price of a thing I'm interested in buying, or where to get it from, but may not remember where the information came from. I will eventually have some familiarity with a face, but it takes longer to remember the correct name, and doesn't always happen. I might never remember any particular location, and don't seem to posses the intuition I see others exercising when navigating the world, or even any particular building, or location of a certain size.

That memory problem might sound quite comical, even advantageous given our profession, but it results in a person that may give the first or persistent impression that they can't remember you, your face, your name, if you have children, or are married, or anything else about you. It creates a person that seems incapable of paying attention to other humans, that will deny having even met you, or having had that conversation which you really enjoyed, or I enjoyed.

For most of my professional life, I have been isolated from the real world. Communication with non familial humans has been mostly electronic, and while it may be a tempting explanation to invoke, it's not correct. These symptoms and behaviours pre-date my professional career, some existed as a pre-teen child.

Most of the time, I have a way above average output, not to boast, but I (usually) live to code ... my children have been late for school before (not by much) so I can finish writing or testing a bit of code - because driving to school while thinking about code is, of course, impossible. When I'm given a task, especially one that involves my doing stuff that hasn't been done before, or is exceptionally challenging, it consumes me until it is done. My enthusiasm delivers my above average output.

Twice now, the enthusiasm has been sucked out completely, and replaced by such a strong feeling of dread and worthlessness, that lethargy takes over and prohibits normal function in every area of life.

Many of you will be quick to label this as "burnout": When you can't find the enthusiasm you require to take your kids anywhere, or go to a school event for them, or go shopping for food, or make love to your wife, it is not burnout.

In our technical world, it's all too easy to dismiss behaviours that are symptomatic of mental illness as a product of our geeky-ness, and our working environments. I think this is one of the reasons why these behaviours have been accepted as normal (or normal-for-geeks) and then ignored, even by those closest to me.

The first time the enthusiasm disappeared, life pretty much fell apart, I had no idea what was happening.

I think, I am just recovering (with life intact) from the second time ...

It was extraordinarily difficult to approach my wife, and soul mate of more than a decade, and say out loud "I think I have a mental health problem". They are the words I used, I needed to be direct, unambiguous. By the time I had finished the sentence, I could see the understanding written on her face, the relief ... I'm lucky to have her.

Doctors appointments and diagnosis (acute social anxiety and depression) followed, and treatment is ongoing.

The antecedent factors that lead to these difficulties don't interest me very much,  and are shrouded in the blurriness of my memory whatever. Getting at the reasons seems all but impossible. What concerns me is how to live and function with these symptoms today, not the days or months from my childhood that may or may not have helped to form or exacerbate them.

Compared to the discomfort of telling my wife, telling my manager was easy. My manager has a beard, 
which obviously helps ... while my wife has no beard. In all seriousness, it was only easy because of the understanding nature of the response I received.

I'm not back to normal yet, there are good days and bad, but none of them are normal. I don't fully understand why this has happened, and tend to think it is not fully understandable in principle whatever.

I'm saying all of this in public for two reasons: I think I owe the community an explanation for my recent absence, an absence which may well effect your ability to do your job properly. Most importantly, I feel obliged to speak in order to normalize talking about it, and directly encourage anyone who thinks "that's me" when reading this, to get some help.

Even the internal dialogue that lead to the realization that I have to tell my wife anything at all, was, in some small way, liberating. Surely this helped to provide whatever I needed to first utter those words.

I can't apologize for being unwell, but I can and should apologize if I let you down without explanation; I'm sorry.

Wednesday, 13 April 2016

Breaking Badly

If you're not running PHP 7 already, you are either crazy, or else your unit tests rely on software that I wrote for PHP 5 ... uopz.

uopz is a runtime hacking extension of the runkit and scary stuff genre.

When I first wrote uopz, PHP 5 was almost in a state of equilibrium. There were minor changes effecting the kind of stuff those extensions do, but by the time 5.5 came the platform was more or less stable.

When PHP 7 came along, it was a major roadblock that we could not run our unit tests. Badoo and many other large projects also had the same problem.

While there were tickets open requesting support for PHP 7, I omitted to answer the tickets in favour of working on the code. I wrongly assumed, if you were waiting for updates you would be watching closely. I also omitted to tag any issues in any commits I was making simply because I'm bad at git.

Various tweets and tickets went unanswered ... sorry about that, but ... code and ... me.

For many months, you could compile uopz for PHP 7 and it would "work", but it was terribly unstable.

I have my fingers in many pies, maintain many things; uopz was not the only road block and not the most important thing to do. It did receive attention ... but I will admit, not enough attention.

More precisely, not enough of the right kind of attention. Every time I came to work on uopz I was focused on making it do exactly the same things it done before, in exactly the same way.

Some of the stuff uopz does is semi-ordinary, it even calls Zend API in a lot of cases. However, copying functions (in the bitwise+instruction by instruction sense), manipulating the global function table, and class function tables, is not ordinary; This is what makes uopz or runkit scary, and useful.

PHP 7 is vastly different to PHP 5 internally, the VM is a much more complicated place to try and get work done.

If we think about a year from now, or two years from now:
  • What happens when Zend has a JIT ? 
  • What happens if Opcache makes class entries immutable, and so shares them ?
  • What happens if Opcache makes function entries immutable and re-entrant, and so shares them ?
I do not know the answers to those questions, but they are good questions.

When these thoughts are communicated clearly, it might be obvious that uopz cannot work in the same way, and be stable, or forward compatible.

Function Mockery v5 


You do not need to delete, rename, or otherwise modify functions, or function tables.

The purpose of allowing you to delete, or rename, a function was to allow you to create another one in it's place.

The purpose of allowing you to create a new function in the place of the deleted function, was so that you could define new behaviour.

This is complicated by the fact that your new function may need to invoke the original function with certain parameters.

Almost certainly, at some time the original function will be explicitly restored, possibly after a group of tests (tearDownAfterClass perhaps).

Even if restoration is not performed explicitly, at request shutdown everything must be restored - you cannot leave a user function in the function table of an internal class, nor can you delete internal function entries earlier than the engine is expecting.

Simply deleting the function is not an option, you have to keep it, and ideally provide a way to copy it into a closure in userland.

That's a rather roundabout way of doing things, don't you think ?

Here's a better way:
function uopz_set_return(string class, string method, mixed value [, bool execute = false]) : bool;
function uopz_set_return(string function, mixed value [, bool execute = false]) : bool;

This new API does not modify the function table, instead it intercepts the execution of an existing function and allows you to set a return value.

The return value can be any variable, or a Closure to be executed in place of the original function, but still without modifying any function tables:
uopz_set_return('strlen', function(string $string) : int {
 return strlen($string) * 2;
}, true);

The code above yields:
In some cases, you do not want to modify the behaviour of the original function, but rather modify some state or perform some other action upon entry to a particular function:
uopz_set_hook('strlen', function(string $string) {
 echo "Expect: int(4)\n";

The code above yields:
Expect: int(4)
Hook and return closures are bound to the current scope at runtime:
class Foo {
 private $bar = true;

 public function qux() {
  return false;

uopz_set_return(Foo::class, 'qux', function() {
 return $this->bar;
}, true);

$foo = new Foo();

The code above yields:
Setting hooks and returns should have most use cases covered, still there are times when you need to add a non-existent function:
function uopz_function_add(string class, string method, Closure handler [, int flags = ZEND_ACC_PUBLIC]);
function uopz_function_add(string function, Closure handler [, int flags]);
This new API allows you to do that, it is similar to uopz_function, but can only add functions, it will not replace functions.

I was reluctant to allow adding functions at all, it makes everything slower because it means we have to disable function entry caching. All of the uses we have left now are questionable, I suspect the same is always true.

The majority of the time, you were only adding a function because there wasn't a better way ... use the better way :)

Class Mockery v5


Allowing userland code to over ride opcode handlers was always bat shit crazy!

In PHP 5, to provide a mock class at test time, you had to overload ZEND_NEW and change the name of the class.

Not only is that crazy, it's bad.

If you happen to be running tests that create 1000 objects of a particular kind, but could use the same object, you are wasting the resources consumed by the creation of 999 objects. In test suites where we have tens of thousands of tests, this can have a dramatic effect.

In PHP 7 we have anonymous classes, this allows us to have a rather beautiful API:
interface IFoo {
 public function bar() : bool;

function consumer(IFoo $foo) : bool {
 return $foo->bar();

uopz_set_mock(Foo::class, new class implements IFoo {
 public function bar() : bool {
  return true;

var_dump(consumer(new Foo()));
The code above yields
uopz_set_mock can also accept a class name as the second parameter, the following code will behave identically to the code above:
interface IFoo {
 public function bar() : bool;

function consumer(IFoo $foo) : bool {
 return $foo->bar();

class Mock implements IFoo {
 public function bar() : bool {
  return true;

uopz_set_mock(Foo::class, Mock::class);

var_dump(consumer(new Foo())); 
Anonymous classes have superseded the role of uopz_compose, which used to allow composition of classes at runtime, in a rather awkward way. While uopz_compose was a nice toy, we are looking for stability, and forward compatibility, which are guaranteed if we are relying on language features.


I broke BC, and I feel bad about that ... my pain is eased by the thought that I provided a more stable, superior API to work with, that has a chance of being forward compatible with whatever Zend does next.

I also feel bad that I haven't had time to update the documentation for uopz yet; For now the README is the documentation.

If anyone feels like helping with documentation, that would be much appreciated.

Happy testing :)

Tuesday, 15 March 2016

Hacking PHP 7

Recently, I have taken part in some screen casts with my good friends at 3devs.

The subject of the screen casts are extension development for, and hacking PHP 7 (Part 1, Part 2).

Screen casting is a medium I haven't mastered, or had very much practice at.

While I'm trying to plan the content for the show, I can't help but be reminded of every lecturer and tutor that stood in front of me repeating facts, sometimes literally reading from a book, or their own notes.

As a rule of thumb, if someone says something to me, and my life does not depend on my retention of the information contained in the statement, I will immediately, and without prejudice to the speaker, forget what was said.

One of the first things we learn as children are our multiplication tables, and we first remember those which have a pattern, or some trick, to describe, or determine the sequence of numbers. We don't remember the sequence until we have uttered it at least a few hundred times. The same goes for the alphabet; We roughly remember a kind of melody, and get most of the letters in the second half of the alphabet wrong for quite some time.

As education progresses, perhaps for reasons of logistical expediency, the focus does shift from inspiring us, with the beauty of the rainbow, to learn how the rainbow works, to barking facts at us about the past wars and rulers of our respective country, and having us try to memorize various tables of information.

This sucks, it sucks so hard: Just when your mind becomes fully primed, and developed enough to really respond to inspiration, the inspiration is almost completely squashed from our education process.

Some may be lucky enough to have a better experience of education, but broadly speaking, until higher education anyway, this is how we were, and our children still are "taught".

It is the job of the teacher to inspire listeners to learn for themselves, it is not the job of a teacher to bark facts. I've really tried to convey that in the material we prepared.

Screen casting is a difficult medium however, so to accompany the screencast is this blog post.

Writing extensions is fun, but it's not as fun as hacking PHP. So, we're going to focus on hacking, we're going to imagine that we are introducing some new language feature, by RFC.

Without focusing on the RFC process itself, you need to know which are the relevant parts of PHP you need to change, in order to introduce new language features.

You also need to know how PHP 7 works, about each stage of turning text into Zend opcodes ...

In the Beginning: Lexing

When the interpreter is instructed to execute a PHP file, the first thing that happens is lexing, or if you prefer, lexical analysis.

The lexer, accepts a stream of characters as input, and emits a stream of tokens as output.

The input to the lexer is the characters of the code. The output is those sequences that the lexer recognizes, identified in a useful way.

The following function is illustrative of what a lexer, or lexical analysis does:
function lexer($bytes, ...) {
    switch ($bytes) {
        case substr($bytes, 0, 2) == "if":
            return TOKEN_IF;

A lexer doesn't actually work like that, but it's enough to understand what it does, you can read associated documentation, or source code to discover how it does it.

The lexer function itself is generated by software known as a "lexer generator", the one that PHP uses is named re2c.

The input file to the lexer generator is a file which contains a set of "rules" in a specific format.

I'll just take one illustrative excerpt from the input for the lexer generator:

This can be roughly translated to "if we are in a scripting state, and find the sequence 'if', return the identifier T_IF".

What "scripting state" means becomes clear when you remember that PHP used to be embedded in HTML: Not all of a file the interpreter reads is executable PHP code.

Consuming Tokens: Parsing

The output from lexical analysis now goes through the process of parsing, or if you prefer, syntactical analysis.

You can think of the parser as a syntax2structure function, which will take your stream of tokens produced by lexical analysis, and create some kind of representative data structure.

Once again, the parser function is generated, by software known as a "parser generator", the one that PHP uses is bison.

The input file to the parser generator, which is referred to as the "grammar", is another set of rules.

Here's my illustrative excerpt from the input for the parser generator:
    T_IF '(' expr ')' statement 
      { /* emit structure here */ }

This can be roughly translated as "if we find the syntax described, execute the code in braces".

Historically, pre PHP 7, parsing the code was the final step in compilation of your code: the parser emitted opcodes directly: we had one-pass compilation.

Today, in PHP 7, parsing the code is the penultimate step in compilation; The parser in PHP 7 emits an abstract syntax tree: we now have multiple-pass compilation.

In the End: Consuming AST

The final pass, is when we emit the target form, Zend opcodes.

Each node in the tree is passed in to a compiler function, depending on it's kind, which in turn passes it's child nodes in to their respective compiler functions, and so on ...

Here's my (abbreviated) illustrative excerpt:
void zend_compile_stmt(zend_ast *ast) /* {{{ */
 if (!ast) {

 /* ... */

 switch (ast->kind) {
  /* ... */
  case ZEND_AST_IF:
  /* ... */
   znode result;
   zend_compile_expr(&result, ast);
/* }}} */

A new feature may not always require new abstract syntax, if this is the case, you can implement your RFC by using the AST, and Compiler API.

New abstract syntax will require a new compiler function, but may emit existing opcodes.

The VM

The virtual machine is the thing that executes opcodes; It's as much like a CPU, as opcodes are like instructions for a CPU.

For each opcode exists a handler, here is my illustrative excerpt:
 zend_free_op free_op1;
 zval *val;

 if (Z_TYPE_INFO_P(val) == IS_TRUE) {
  ZEND_VM_SET_NEXT_OPCODE(opline + 1);
 } else if (EXPECTED(Z_TYPE_INFO_P(val) <= IS_TRUE)) {
   ZEND_VM_JMP(OP_JMP_ADDR(opline, opline->op2));
  } else {
   ZEND_VM_SET_OPCODE(OP_JMP_ADDR(opline, opline->op2));

 if (i_zend_is_true(val)) {
 } else {
  opline = OP_JMP_ADDR(opline, opline->op2);
 if (UNEXPECTED(EG(exception) != NULL)) {

New opcodes are going to need to new handlers, which you simply add to zend_vm_def.h, at the end of the list, using the next available opcode number.

The prototype for the macro ZEND_VM_HANDLER can be assumed:
    ZEND_VM_HANDLER(opnum, opname, op1_type, op2_type)

This is obviously not standard C code, this definition header file is used as input to generate the real handler functions for all the permutations of operand types used.

When changes are made to the header, you must regenerate the VM by executing zend_vm_gen.php (in the Zend folder).

The Hack

The first hack introduced in the screen cast was a hipster expression. The hipster expression looks like a function call (à la include_once). If the result of the expression passed to hipster is a string, the result is copied to the return value, or else an exception is thrown.

This is a completely useless feature, other than it's illustrative value.

Since we are introducing a new token "hipster", we must start by editing the lexer to include the following code:
<ST_IN_SCRIPTING>"hipster" {

Care should be taken to make edits in sensible places in these generator input files.

The next thing to do, is making the parser aware of the new token.

Open up the parser generator input file and search for "%token T_INCLUDE", something like the following will be found:
%token T_INCLUDE    "include (T_INCLUDE)"

%tokens have the form %token ID FRIENDLY_NAME, we must add our new token:
%token T_HIPSTER "hipster (T_HIPSTER)"

Now search for "%left" (or scroll down), something like the following should be found:

This section of the file is where operator associativity is set, and precedence implied in the order.

Since we are introducing a non-associative token we add the line:
%nonassoc T_HIPSTER

In an appropriate place, at the end of the first block of %nonassoc tokens, for example.

Now we must add a new rule to the parser in an appropriate place, since we are adding a new expression, we use the expr_without_variable rule.

In the example hack, we add:
 | T_HIPSTER '(' expr ')' { $$ = zend_ast_create(ZEND_AST_HIPSTER, $3); }

 | internal_functions_in_yacc { $$ = $1; }

Notice we are going to use a new kind of AST node, so the next thing to do is edit zend_ast.h to add our new ZEND_AST_HIPSTER to the enumerated _zend_ast_kind.

The enumerated kinds have a magical format that requires studying to determine where to add the new kind, we add it here:

Because our new node will have a single child, the expression.

The next thing to do is editing the switch in the zend_compile_expr compiler function, adding, in an appropriate place:
    zend_compile_hipster(result, ast);

Then we must define zend_compile_hipster:
void zend_compile_hipster(znode *result, zend_ast *ast) /* {{{ */
 zend_ast *expr_ast = ast->child[0];
 znode expr_node;

 zend_compile_expr(&expr_node, expr_ast);

 zend_emit_op(result, ZEND_HIPSTER, &expr_node, NULL);
} /* }}} */

At this point, the lexer, parser, and compiler are all aware of our new feature, but, we emit an opcode that does not exist.

So, we edit zend_vm_def.h to add our new opcode handler:
 zend_free_op free_op1;
 zval *op1;


 if (Z_TYPE_P(op1) != IS_STRING) {
  zend_throw_exception_ex(NULL, 0,
   "hipster expects a string !");

 ZVAL_COPY(EX_VAR(opline->result.var), op1);


Remembering to regenerate the vm after our change.

We're done, so execute make, and then marvel at:
sapi/cli/php -r "echo hipster('PHP 7 rocks') . PHP_EOL;"


The best way to learn about PHP internals is to dig in: If all of us got together in a room tomorrow, and did not leave until we had produced a volume of encyclopedic knowledge, the best way to learn would still be to dig in.

If everything were explained in exquisite detail here, you probably wouldn't have got to the end ... You may have to read this more than once, I'm sure you'll have to do a bunch of research besides ...

Hopefully, I've given you enough links, terms to research, and things to think about to inspire you to do just that ...

Happy hacking :)