Sunday, 3 June 2018

Preface to idbg

Fig 1. A tweet from earlier this month
We already have several options for debugging code within the PHP ecosystem. XDebug is extremely mature software, and phpdbg has been slowly gaining traction also, if for no other reason than it's very fast to collect code coverage compared to XDebug.

Although phpdbg and XDebug are different from one another, they have some things in common: They are both complicated (to varying degrees), and they are both written in a language that 97% of the people reading this text do not understand (number pulled from air, based on nothing at all).

Why do we need a debugger at all?

Slightly tangential perhaps ... Debugging is a necessary part of writing code; If you disagree with this statement, then I don't know what you are talking about.

If debugging to you means sprinkling code with debugging statements like var_dump, or print_r, then I implore you to learn how to use a debugger; You are wasting a lot of time. I don't say this from a position of arrogance because I happen to be one of the authors of phpdbg, but from a position of experience; I remember trying to write code before I had a good handle on using a debugger. 

Sprinkling code with debugging statements is like crouching on a stool in the corner of a cockroach infested room and hoping that blowing upon the blanketed floor will destroy and eliminate the little beasts creating the blanket.

Using a debugger is like having an army of nano bots at your disposal, each one trained exquisitely in a top nano-bot-training-camp, they live to kill cockroaches, some of them also have mean looking tattoos, chew tobacco, and spit on the ground at the start of every sentence ...

I think we understand each other ...

Why do we need a debugger written in PHP ?

Here are some statistics (from github api):
  • XDebug has had 50 contributors in the 7 years it has been on github
  • phpdbg has had a handful of contributors (20-30) in the 4 and a half years it has been on github
  • PHPUnit has had 342 contributors in the 8 and a half years it has been on github
  • phpstan has had 70 contributors in the 2 and a half years it has been on github
XDebug predates it's github repository (by a very wide margin), still it doesn't matter for the point I'm trying to make here: In the PHP ecosystem, we have very many very talented programmers, with a whole host of knowledge about how the PHP engine works - they may have been using it for their entire professional career - they are able to write and contribute to arguably comparatively complicated software like PHPUnit or phpstan. Alas we have vanishingly few programmers in the ecosystem that are able to improve, fix, or develop in any way software like phpdbg or XDebug, and I think it's mostly because of the language they are written in.

You might also just like to scan the number of contributors to projects like Laravel and Symfony ... although I think these numbers less relevant, they are surprisingly high.

It's not all about the language, the domain specific knowledge required to implement a debugging engine might not be so disseminated. But maybe it doesn't need to be ...

You may not find these arguments convincing, you may not be convinced that we need another debugger written in any language, after all XDebug is extremely mature, and using phpdbg makes you at least 20% cooler (in the same way as go faster stripes make any vehicle 20% faster). That's a perfectly rationale position to take, and I can't think of another way to argue my case, you can probably stop reading ;)

Domain Specifics

I don't know how obvious it is that it's not reasonable to talk about implementing a debugger entirely in PHP; The kind of control you need over the engine just isn't reasonably attained in userland by default.

The debugger itself, the thing that interacts with a person or an ide can be written in PHP alone, and is much easier to write in PHP. But the core of the "debugging engine" (terminology borrowed from dbgp specification) should be written in C.

krakjoe/inspector is a disasembler and debugging kit for PHP, it exposes the necessary API for the development of a debugger in userland. It is an advanced extension of the existing Reflection API, giving it a shallow learning curve for anyone already familiar with Reflection.

While it's annoying that we still must have a binary dependency, I'm hoping that inspector becomes a defacto part of php installations in the not too distant future. Although I have no intention of making an RFC to merge inspector into core - it belongs outside of core, the release cycle in core does not lend itself to new software and there is nothing to be gained by merging. Being a defacto part of installations doesn't necessarily mean merged into core.

Code or STFU

Fig 1. idbg help
This isn't just pipe dreams, the PHP code exists, it's alpha quality and largely untested ...

There is much to do and you shouldn't design your workflow around this (or any alpha quality software) yet.

What you should do is start reading code, testing, and opening pull requests ... consider me waiting ...

Tuesday, 22 May 2018

PHP allows for the design of X

Fig 1. A thing I said
Starting complicated twitter conversations should be avoided, I know this, and yet blurted this out on twitter recently ...

This was met with a flurry of responses and I couldn't reasonably reply in tweet form. I'm going to respond to some of those tweets (indirectly) and further explain my original tweet.

PHP is not always the right tool for the job

First and foremost, I was misunderstood by some people; They thought I was saying you should use PHP for everything. Obviously, that would be an untenable position, which I do not assume.

Give a task to a polyglot and they won't spend time enumerating for possible exclusion all of the languages they know. It doesn't work like that, you don't start thinking about the most unsuitable language for a task and somehow work your way to a suitable language. Choosing a suitable language is a thing you want to call an instinct, but it isn't an instinct, it's guided by an understanding of the task, and prior knowledge of domain support among your chosen poisons ... I mean languages.

There are totally legitimate reasons to choose other languages over PHP, even in the domain where PHP excels - on the web. But it doesn't have very much to do with PHP, and has everything to do with the chosen language and the task. You likely weren't thinking about PHP when you made the decision.

PHP is not a templating language

Whenever someone says "but PHP was designed as a templating language", I almost want to cry.

Who actually cares what PHP was in the year 1997, the number of lines of code from that software left in PHP is minuscule, if there are any present at all.

In the year 2018, we don't even care what PHP 5 was, we don't care about it's shortcomings, because we should not and do not use it.

Today, when we talk about PHP, we are talking about PHP 7 ... here are some actual facts (the things you can't have your own opinions on):
  • PHP 7 is fast
  • PHP 7 is a general purpose scripting language
I'm a C programmer, I spend most of my time writing C, and spend some time at levels below that playing with machine code, JIT compilers and so on. When I say PHP 7 is fast, I mean to say that as a C programmer, it's difficult to write code (of equivalent complexity) as efficient as the code Zend generates in the vast majority of cases. It's also as near as makes no difference impossible to JIT Zend opcodes into machine code and have them be more efficient, the facts of the matter are that the assembly that is generated when Zend is compiled is as efficient as any assembly you could hand craft inline, or generate just in time (that's not a guess, I've personally tried  both hand crafting inline asm and JIT compilation of ze3 opcodes).

Obviously PHP is stuck with one data structure, but it's not just a dumb HashTable in PHP 7 anymore, it's smart and will perform optimally most of the time. The structure of a HashTable and the shortcomings of those structures are less important when our applications are heavily object orientated. Reading/writing/interacting with properties on PHP 7 objects is almost entirely unaffected by those things; Given a warm runtime cache, reading a property from an object consists a relative load (a very simple machine code instruction), there is no ht lookups involved. This is also true of HashTables in some cases (they can behave like C arrays).

PHP allows for the design of X

When a project like AMP shows up on the scene, you can't say "PHP wasn't designed for asynchronous execution", it's a nonsensical statement since PHP is a general purpose language, given that support has emerged in this new domain, as a matter of fact PHP does support it, and not accidentally.

Is the support for this new domain as mature as another language in your arsenal ? I don't know what languages you have, whatever this is not a reason to assume that the people working on AMP, or X, are wasting their time because of some imaginary (and it is mostly imaginary) shortcoming in PHP.

Reddit was recently discussing a GUI extension I wrote; It's very frustrating to hear people who don't really know what PHP is capable of decrying it as a waste of time, or no better than software from 10 years ago.

Almost every extension I write gets the same sort of response: "Just because you can, doesn't mean you should". What they are communicating is "It doesn't matter if you can, you shouldn't", which is somewhere between silly, and harmful.

It really does matter if you can.

When support emerges for a new problem domain, let's be pragmatic and observe that expanding the horizons of PHP in any direction is good for the community that relies on PHP (and maybe PHP alone) to make a living. Let's not rush to take new solutions to production tomorrow, but let's not dismiss anything out of hand because of some imaginary short coming in PHP. 

Ideally, let's find time to learn about the new solution to see if it's useful to us, perhaps try to use it in our prototypes and drafts, and in so doing improve it.

PHP 7, as a matter of fact, is internally designed to bend to the programmers will ...

When support emerges for a new domain, take that as proof that PHP allowed for the design of X.

Monday, 16 April 2018

An Introduction to CQL


Recently I have been working on a CommonMark extension for PHP7. It is based on the reference implementation in C, linking to it rather than re-implementing the spec.

The reference implementation in C is extremely fast, and so the extension has a focus on performance, trying to create PHP objects only when necessary, among other (boring) optimisations.

In C the iterators provided by the reference implementation are extremely fast; It simply doesn't matter that you might have to accept every node in a document when you're working in C.  In a dynamic language like PHP it really does matter, even if the objects representing the nodes are short lived. Again when you access a parent or child node in C, you are just doing pointer arithmetic (hidden behind function calls), it's all simple stuff. When it comes to a dynamic language there is all kinds of baggage attached to the object (and even the read operation itself), additional allocations and other such instructions must be executed before the C pointer can be passed into user land.

While the iterators from the reference implementation are fast, they are not smart - they don't really need to be, as explained. When it comes to inspecting a document (before conversion for example, or for editing), the kind of code you need to write in any language consists of complicated nested loops and or recursive calls, it's long and complicated, and difficult to get right.

Introducing CQL

CQL - CommonMark Query Language is a feature that has been developed alongside the CommonMark PHP extension, which solves some of the problems of iterating through a tree structure in a dynamic language by allowing the user to express as a string how to travel through the document and which nodes to return.

CQL consists of a lexer and parser, a compiler for a small set of instructions, and a virtual machine for executing the instructions.

Syntax

For the real geeks, they can just look at the context free grammar, for the rest of us, a query describes a path through a document:

/firstChild/children

The above query will return the children of the first child node of a tree.

firstChild, lastChild, parent, next, previous, and children are all accepted paths.

children can accept sub queries (but cannot have other paths following it, because think about it ...):
/firstChild/children[ /children ]
The above query will return the children of the children of the first child node of a tree.

children can also accept a constraint:

/firstChild/children(BlockQuote)

The above query will return the children of the first child node of a tree that are BlockQuote objects.

Constraints may be or'd together:

/firstChild/children(BlockQuote|Paragraph)

The above query will return the children of the first child node of a tree that are BlockQuote or Paragraph objects.

Subqueries with constraints can also have subqueries:

/firstChild/children(BlockQuote)[ /children(Paragraph) ]

The above query will return Paragraphs that are children of BlockQuotes that are children of the first child node of a tree.

Constraints and sub queries may be nested ad-absurdum to describe a path to take through the tree. The form of the queries I have used here is for readability only, whitespace is ignored, and content after # is ignored.

Execution

Having lexed and parsed your query into an abstract syntax tree, CQL compiles the AST into discrete instructions for travelling through the tree. We're going to skip over a description of that AST because it's throw away and boring. Let's have a quick look at the result of compiling the AST, the instructions:

Each instruction has an input value (IV), and an output value (RV) or JMP target (#T), in addition it has an extended value (int) for storing constraints, and probably other things in future.

We'll start simple, with /firstChild/lastChild, which compiles to:

For simplicity, you can consider the numbers in IV and RV columns variables, the first instruction FCN sets 1 to the first child node of 0, the second LCN sets 2 to the last child node of 1, and the third instruction ENT dispatches a call to the caller of the function with the address of the node at 2.

Remembering that these "variables" are just addresses, no zvals, no php vars, all very low level stuff.

It gets a little more complicated when it comes to children, /firstChild/children compiles to the following instructions:

The first instruction FCN sets 1 to the first child node of 0, the second instruction sets the first child node of 1 to 2, the third ENT dispatches the enter call. The next instruction NEN sets 3 to the node next to 2 in the tree, the next SET instruction sets 2 to 3, and the next JMP jumps to ENT if 3 is positive, creating a loop until all the children are consumed.

The textual description of a query like:
/children(List)[ /children(Item)[ /children(Paragraph) ] ]
would be extremely boring, but here's what that query looks like:
The only new instruction is CON, which will skip nodes that do not match the constraint given.

The virtual machine that executes the instructions looks like:

Making execution of the query extremely efficient, much more so than you would be able to write in PHP.

PHP API

Proper documentation for the PHP API will become available in the manual soon, here's a quick description for those that want to get started:

The CommonMark extension declares \CommonMark\CQL:
class \CommonMark\CQL {
    public function __construct(string $query);
    public function __invoke(\CommonMark\Node $node, callable $enter);
}

The callable provided $enter should have the prototype:
function (\CommonMark\Node $root, \CommonMark\Node $node)
and will be invoked by CQL on ENT instructions.

Get Involved or Wait :)

I am not finished writing tests for CQL yet, so it currently lives in a feature branch. It will be included in the next release of the extension, probably in the next couple of weeks.

If you feel like being helpful, you could come and submit a PR for tests ...

Peace out phomies ...

Tuesday, 16 January 2018

Sensible Targets

Fig 1. Current release cycle graph
There has been a lot of talk recently about which versions of PHP you should support for your new projects or packages.

As a release manager for PHP, someone who watches the way releases evolve extremely closely, and has some sway over what gets fixed and what doesn't, and as someone who helped to draft the security classification document for PHP; I feel I have some useful things to say on this subject, so here goes ...

The Release Cycle

Every release of PHP goes through the following cycle:

Pre Release:
  • 3 Alpha, two weeks apart
  • 3 Beta, two weeks apart
  • 6 RC, two weeks apart
Release:
  • GA, roughly 6 months after the first alpha
Actively Supported:
  • For two years the PHP team make a patch release every month with bug and security fixes.
Security Only:
  • For the final year of the three year cycle, the PHP team will make a patch release when security fixes warrant a release.

What isn't a security issue ?

We have various definitions to classify the threat level posed by any security issue, but importantly for the community at large, we also have a definition of what is not a security issue:
  • requires invocation of specific code, which may be valid but is obviously malicious
  • requires invocation of functions with specific arguments, which may be valid but are obviously malicious
  • requires specific actions to be performed on the server, which are not commonly performed, or are not commonly permissible for the user (uid) executing PHP
  • requires privileges superior to that of the user (uid) executing PHP
  • requires the use of debugging facilities - ex. xdebug, var_dump
  • requires the use of settings not recommended for production - ex. error reporting to output
  • requires the use of non-standard environment variables - ex. USE_ZEND_ALLOC
  • requires the use of non-standard builds - ex. obscure embedded platform, not commonly used compiler
  • requires the use of code or settings known to be insecure
Any issue that falls under any of the above categories, even though it may have security implications for you personally, is not treated as a security issue. It may be fixed as a normal bug, but that fix will not be included in a security fix only release.

In addition, any security issue classified as having a low threat level will not necessarily be included in a security fix only release. The lowest level of threat is defined thus:
This issue allows theoretical compromise of security, but practical attack is usually impossible or extremely hard due to common practices or limitations that are virtually always present or imposed.
This also includes problems with configuration, documentation, and other non-code parts of the PHP project that may mislead users, or cause them to make their system, or their code less secure.
Issues that can trigger unauthorised actions that do not seem to be useful for any practical attack can also be categorised as low severity.
Security issues, that are present only in unstable branches, belong to this category, too. Any branch that has no stable release, is per se not intended for the production use.
Aside from those bugs which may or may not be a security issue, there is a variety of bugs that are definitely not a security issue, but may cause your project serious harm, or present serious problems for your package - No fix is forthcoming for these, the vast majority of bugs, while the release is in security fix only cycle.

What should I target ?

Targeting a security fix only release of PHP for new projects doesn't make any sense: When a release is in security fix only cycle you should be concentrating on getting old projects upgraded, and a year is plenty of time to do that. In the case of PHP 5.6, we extended the security cycle to two years. If you are reading this and thinking a year isn't long enough to do that, then there is something wrong with the way you deploy or support projects or packages: It has to be long enough, running a version of PHP without active support is dangerous for your business, reputation, and soul.

New projects or packages should obviously target an actively developed version of PHP. At the time of writing both 7.1 and 7.2 are being actively developed (there will always be two versions in active development). Whether you choose to use 7.1 or 7.2 depends on your project or package, perhaps you'd like to use some new features, and not have to worry about security cycle for nearly two years and so can reasonably target 7.2. Perhaps you have reasons to stick with 7.1, and are prepared to deal with security cycle before the year is out.

The bottom line is this: New projects or packages must target an actively developed, fully supported version of PHP.