Tuesday, 15 March 2016

Hacking PHP 7


Recently, I have taken part in some screen casts with my good friends at 3devs.

The subject of the screen casts are extension development for, and hacking PHP 7 (Part 1, Part 2).

Screen casting is a medium I haven't mastered, or had very much practice at.

While I'm trying to plan the content for the show, I can't help but be reminded of every lecturer and tutor that stood in front of me repeating facts, sometimes literally reading from a book, or their own notes.

As a rule of thumb, if someone says something to me, and my life does not depend on my retention of the information contained in the statement, I will immediately, and without prejudice to the speaker, forget what was said.

One of the first things we learn as children are our multiplication tables, and we first remember those which have a pattern, or some trick, to describe, or determine the sequence of numbers. We don't remember the sequence until we have uttered it at least a few hundred times. The same goes for the alphabet; We roughly remember a kind of melody, and get most of the letters in the second half of the alphabet wrong for quite some time.

As education progresses, perhaps for reasons of logistical expediency, the focus does shift from inspiring us, with the beauty of the rainbow, to learn how the rainbow works, to barking facts at us about the past wars and rulers of our respective country, and having us try to memorize various tables of information.

This sucks, it sucks so hard: Just when your mind becomes fully primed, and developed enough to really respond to inspiration, the inspiration is almost completely squashed from our education process.

Some may be lucky enough to have a better experience of education, but broadly speaking, until higher education anyway, this is how we were, and our children still are "taught".

It is the job of the teacher to inspire listeners to learn for themselves, it is not the job of a teacher to bark facts. I've really tried to convey that in the material we prepared.

Screen casting is a difficult medium however, so to accompany the screencast is this blog post.

Writing extensions is fun, but it's not as fun as hacking PHP. So, we're going to focus on hacking, we're going to imagine that we are introducing some new language feature, by RFC.

Without focusing on the RFC process itself, you need to know which are the relevant parts of PHP you need to change, in order to introduce new language features.

You also need to know how PHP 7 works, about each stage of turning text into Zend opcodes ...

In the Beginning: Lexing


When the interpreter is instructed to execute a PHP file, the first thing that happens is lexing, or if you prefer, lexical analysis.

The lexer, accepts a stream of characters as input, and emits a stream of tokens as output.

The input to the lexer is the characters of the code. The output is those sequences that the lexer recognizes, identified in a useful way.

The following function is illustrative of what a lexer, or lexical analysis does:
<?php
function lexer($bytes, ...) {
    switch ($bytes) {
        case substr($bytes, 0, 2) == "if":
            return TOKEN_IF;
    }
}
?>

A lexer doesn't actually work like that, but it's enough to understand what it does, you can read associated documentation, or source code to discover how it does it.

The lexer function itself is generated by software known as a "lexer generator", the one that PHP uses is named re2c.

The input file to the lexer generator is a file which contains a set of "rules" in a specific format.

I'll just take one illustrative excerpt from the input for the lexer generator:
<ST_IN_SCRIPTING>"if" {
    RETURN_TOKEN(T_IF);
}

This can be roughly translated to "if we are in a scripting state, and find the sequence 'if', return the identifier T_IF".

What "scripting state" means becomes clear when you remember that PHP used to be embedded in HTML: Not all of a file the interpreter reads is executable PHP code.

Consuming Tokens: Parsing


The output from lexical analysis now goes through the process of parsing, or if you prefer, syntactical analysis.

You can think of the parser as a syntax2structure function, which will take your stream of tokens produced by lexical analysis, and create some kind of representative data structure.

Once again, the parser function is generated, by software known as a "parser generator", the one that PHP uses is bison.

The input file to the parser generator, which is referred to as the "grammar", is another set of rules.

Here's my illustrative excerpt from the input for the parser generator:
    T_IF '(' expr ')' statement 
      { /* emit structure here */ }

This can be roughly translated as "if we find the syntax described, execute the code in braces".

Historically, pre PHP 7, parsing the code was the final step in compilation of your code: the parser emitted opcodes directly: we had one-pass compilation.

Today, in PHP 7, parsing the code is the penultimate step in compilation; The parser in PHP 7 emits an abstract syntax tree: we now have multiple-pass compilation.

In the End: Consuming AST


The final pass, is when we emit the target form, Zend opcodes.

Each node in the tree is passed in to a compiler function, depending on it's kind, which in turn passes it's child nodes in to their respective compiler functions, and so on ...

Here's my (abbreviated) illustrative excerpt:
void zend_compile_stmt(zend_ast *ast) /* {{{ */
{
 if (!ast) {
  return;
 }

 /* ... */

 switch (ast->kind) {
  /* ... */
  case ZEND_AST_IF:
   zend_compile_if(ast);
   break;
  /* ... */
  default:
  {
   znode result;
   zend_compile_expr(&result, ast);
   zend_do_free(&result);
  }
 }
}
/* }}} */

A new feature may not always require new abstract syntax, if this is the case, you can implement your RFC by using the AST, and Compiler API.

New abstract syntax will require a new compiler function, but may emit existing opcodes.

The VM


The virtual machine is the thing that executes opcodes; It's as much like a CPU, as opcodes are like instructions for a CPU.

For each opcode exists a handler, here is my illustrative excerpt:
ZEND_VM_HANDLER(43, ZEND_JMPZ, CONST|TMPVAR|CV, JMP_ADDR)
{
 USE_OPLINE
 zend_free_op free_op1;
 zval *val;

 val = GET_OP1_ZVAL_PTR_UNDEF(BP_VAR_R);
 
 if (Z_TYPE_INFO_P(val) == IS_TRUE) {
  ZEND_VM_SET_NEXT_OPCODE(opline + 1);
  ZEND_VM_CONTINUE();
 } else if (EXPECTED(Z_TYPE_INFO_P(val) <= IS_TRUE)) {
  if (OP1_TYPE == IS_CV && UNEXPECTED(Z_TYPE_INFO_P(val) == IS_UNDEF)) {
   SAVE_OPLINE();
   GET_OP1_UNDEF_CV(val, BP_VAR_R);
   ZEND_VM_JMP(OP_JMP_ADDR(opline, opline->op2));
  } else {
   ZEND_VM_SET_OPCODE(OP_JMP_ADDR(opline, opline->op2));
   ZEND_VM_CONTINUE();
  }
 }

 SAVE_OPLINE();
 if (i_zend_is_true(val)) {
  opline++;
 } else {
  opline = OP_JMP_ADDR(opline, opline->op2);
 }
 FREE_OP1();
 if (UNEXPECTED(EG(exception) != NULL)) {
  HANDLE_EXCEPTION();
 }
 ZEND_VM_JMP(opline);
}

New opcodes are going to need to new handlers, which you simply add to zend_vm_def.h, at the end of the list, using the next available opcode number.

The prototype for the macro ZEND_VM_HANDLER can be assumed:
    ZEND_VM_HANDLER(opnum, opname, op1_type, op2_type)

This is obviously not standard C code, this definition header file is used as input to generate the real handler functions for all the permutations of operand types used.

When changes are made to the header, you must regenerate the VM by executing zend_vm_gen.php (in the Zend folder).

The Hack


The first hack introduced in the screen cast was a hipster expression. The hipster expression looks like a function call (à la include_once). If the result of the expression passed to hipster is a string, the result is copied to the return value, or else an exception is thrown.

This is a completely useless feature, other than it's illustrative value.

Since we are introducing a new token "hipster", we must start by editing the lexer to include the following code:
<ST_IN_SCRIPTING>"hipster" {
 RETURN_TOKEN(T_HIPSTER);
} 

Care should be taken to make edits in sensible places in these generator input files.

The next thing to do, is making the parser aware of the new token.

Open up the parser generator input file and search for "%token T_INCLUDE", something like the following will be found:
%token T_INCLUDE    "include (T_INCLUDE)"

%tokens have the form %token ID FRIENDLY_NAME, we must add our new token:
%token T_HIPSTER "hipster (T_HIPSTER)"

Now search for "%left" (or scroll down), something like the following should be found:
%left T_INCLUDE T_INCLUDE_ONCE T_EVAL T_REQUIRE T_REQUIRE_ONCE

This section of the file is where operator associativity is set, and precedence implied in the order.

Since we are introducing a non-associative token we add the line:
%nonassoc T_HIPSTER

In an appropriate place, at the end of the first block of %nonassoc tokens, for example.

Now we must add a new rule to the parser in an appropriate place, since we are adding a new expression, we use the expr_without_variable rule.

In the example hack, we add:
 | T_HIPSTER '(' expr ')' { $$ = zend_ast_create(ZEND_AST_HIPSTER, $3); }

Underneath:
 | internal_functions_in_yacc { $$ = $1; }

Notice we are going to use a new kind of AST node, so the next thing to do is edit zend_ast.h to add our new ZEND_AST_HIPSTER to the enumerated _zend_ast_kind.

The enumerated kinds have a magical format that requires studying to determine where to add the new kind, we add it here:
        
    ZEND_AST_CONTINUE,
    ZEND_AST_HIPSTER,

Because our new node will have a single child, the expression.

The next thing to do is editing the switch in the zend_compile_expr compiler function, adding, in an appropriate place:
    case ZEND_AST_HIPSTER:
    zend_compile_hipster(result, ast);
    return;

Then we must define zend_compile_hipster:
void zend_compile_hipster(znode *result, zend_ast *ast) /* {{{ */
{
 zend_ast *expr_ast = ast->child[0];
 znode expr_node;

 zend_compile_expr(&expr_node, expr_ast);

 zend_emit_op(result, ZEND_HIPSTER, &expr_node, NULL);
} /* }}} */

At this point, the lexer, parser, and compiler are all aware of our new feature, but, we emit an opcode that does not exist.

So, we edit zend_vm_def.h to add our new opcode handler:
ZEND_VM_HANDLER(184, ZEND_HIPSTER, ANY, ANY) 
{
 USE_OPLINE
 zend_free_op free_op1;
 zval *op1;

 SAVE_OPLINE();
 op1 = GET_OP1_ZVAL_PTR_UNDEF(BP_VAR_R);

 if (Z_TYPE_P(op1) != IS_STRING) {
  zend_throw_exception_ex(NULL, 0,
   "hipster expects a string !");
  HANDLE_EXCEPTION();
 }

 ZVAL_COPY(EX_VAR(opline->result.var), op1);

 FREE_OP1();
 ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();
}

Remembering to regenerate the vm after our change.

We're done, so execute make, and then marvel at:
sapi/cli/php -r "echo hipster('PHP 7 rocks') . PHP_EOL;"

Finally


The best way to learn about PHP internals is to dig in: If all of us got together in a room tomorrow, and did not leave until we had produced a volume of encyclopedic knowledge, the best way to learn would still be to dig in.

If everything were explained in exquisite detail here, you probably wouldn't have got to the end ... You may have to read this more than once, I'm sure you'll have to do a bunch of research besides ...

Hopefully, I've given you enough links, terms to research, and things to think about to inspire you to do just that ...

Happy hacking :)

Wednesday, 2 March 2016

Picking an Approach

Fig 1. Several Languages
I should hope that the majority of people reading this consider themselves polyglots.

A polyglot is a person able to speak in many languages; It's almost a requirement of programming that we should know more than one language.

Using the right language for the job is a worthy aspiration to have.

When you, or I, as a programmer are setting out to write a new product, or application, we should definitely consider our options: It's a matter of fact that no language is best for everything.

The right tool for the job makes sense as part of our approach to writing applications ...

Choosing Wisely


Fig 2. PHP Internals

When we approach the design of a language, I submit that it doesn't make sense to use the right tool for the job argument against having a feature.

We need to aim for something in our approach, but what should it be ?

Before I try to define that, we should probably admit that whatever our aim, we might miss by quite wide margins.

It's difficult to organize groups of individuals spread out across the world, when there isn't so much as a single person you could identify as "manager". Your team at work might be spread out across the world, but your teams at work are carefully managed.

We can choose an idyllic aim, even knowing that we will probably miss: Being anywhere close to an ideal, is better than having no aim at all!

I think, our aim should be to provide the best tool for the job.

It's true that no language is best at everything, it's also true that no language is good at everything. However, it has to be part of everybody's ideal that such a language should exist; One that is good at everything.

We will probably miss our target, PHP will never be best at everything, but in aiming for that, we have good chance, over a long period, many versions, of really having a language that is genuinely good at everything (that we care about at the time).

It still won't be the right tool for every job, there will be better languages, forever.

I want to see people stop advancing the right tool for the job as an argument, I want to see people accept foreign ideas more willingly if it creates a better tool for programming.

In the not too distant future, when things like async/await are suggested, don't be tempted to rebel against that because there are better tools for the job.

Don't shout "use go", even though we all know it's probably the best tool for that kind of work today.

Let us make PHP a realistic option for tomorrow, let us at least try to provide the best tool for the job ...

Back to code ...