Tuesday, 15 March 2016

Hacking PHP 7


Recently, I have taken part in some screen casts with my good friends at 3devs.

The subject of the screen casts are extension development for, and hacking PHP 7 (Part 1, Part 2).

Screen casting is a medium I haven't mastered, or had very much practice at.

While I'm trying to plan the content for the show, I can't help but be reminded of every lecturer and tutor that stood in front of me repeating facts, sometimes literally reading from a book, or their own notes.

As a rule of thumb, if someone says something to me, and my life does not depend on my retention of the information contained in the statement, I will immediately, and without prejudice to the speaker, forget what was said.

One of the first things we learn as children are our multiplication tables, and we first remember those which have a pattern, or some trick, to describe, or determine the sequence of numbers. We don't remember the sequence until we have uttered it at least a few hundred times. The same goes for the alphabet; We roughly remember a kind of melody, and get most of the letters in the second half of the alphabet wrong for quite some time.

As education progresses, perhaps for reasons of logistical expediency, the focus does shift from inspiring us, with the beauty of the rainbow, to learn how the rainbow works, to barking facts at us about the past wars and rulers of our respective country, and having us try to memorize various tables of information.

This sucks, it sucks so hard: Just when your mind becomes fully primed, and developed enough to really respond to inspiration, the inspiration is almost completely squashed from our education process.

Some may be lucky enough to have a better experience of education, but broadly speaking, until higher education anyway, this is how we were, and our children still are "taught".

It is the job of the teacher to inspire listeners to learn for themselves, it is not the job of a teacher to bark facts. I've really tried to convey that in the material we prepared.

Screen casting is a difficult medium however, so to accompany the screencast is this blog post.

Writing extensions is fun, but it's not as fun as hacking PHP. So, we're going to focus on hacking, we're going to imagine that we are introducing some new language feature, by RFC.

Without focusing on the RFC process itself, you need to know which are the relevant parts of PHP you need to change, in order to introduce new language features.

You also need to know how PHP 7 works, about each stage of turning text into Zend opcodes ...

In the Beginning: Lexing


When the interpreter is instructed to execute a PHP file, the first thing that happens is lexing, or if you prefer, lexical analysis.

The lexer, accepts a stream of characters as input, and emits a stream of tokens as output.

The input to the lexer is the characters of the code. The output is those sequences that the lexer recognizes, identified in a useful way.

The following function is illustrative of what a lexer, or lexical analysis does:
<?php
function lexer($bytes, ...) {
    switch ($bytes) {
        case substr($bytes, 0, 2) == "if":
            return TOKEN_IF;
    }
}
?>

A lexer doesn't actually work like that, but it's enough to understand what it does, you can read associated documentation, or source code to discover how it does it.

The lexer function itself is generated by software known as a "lexer generator", the one that PHP uses is named re2c.

The input file to the lexer generator is a file which contains a set of "rules" in a specific format.

I'll just take one illustrative excerpt from the input for the lexer generator:
<ST_IN_SCRIPTING>"if" {
    RETURN_TOKEN(T_IF);
}

This can be roughly translated to "if we are in a scripting state, and find the sequence 'if', return the identifier T_IF".

What "scripting state" means becomes clear when you remember that PHP used to be embedded in HTML: Not all of a file the interpreter reads is executable PHP code.

Consuming Tokens: Parsing


The output from lexical analysis now goes through the process of parsing, or if you prefer, syntactical analysis.

You can think of the parser as a syntax2structure function, which will take your stream of tokens produced by lexical analysis, and create some kind of representative data structure.

Once again, the parser function is generated, by software known as a "parser generator", the one that PHP uses is bison.

The input file to the parser generator, which is referred to as the "grammar", is another set of rules.

Here's my illustrative excerpt from the input for the parser generator:
    T_IF '(' expr ')' statement 
      { /* emit structure here */ }

This can be roughly translated as "if we find the syntax described, execute the code in braces".

Historically, pre PHP 7, parsing the code was the final step in compilation of your code: the parser emitted opcodes directly: we had one-pass compilation.

Today, in PHP 7, parsing the code is the penultimate step in compilation; The parser in PHP 7 emits an abstract syntax tree: we now have multiple-pass compilation.

In the End: Consuming AST


The final pass, is when we emit the target form, Zend opcodes.

Each node in the tree is passed in to a compiler function, depending on it's kind, which in turn passes it's child nodes in to their respective compiler functions, and so on ...

Here's my (abbreviated) illustrative excerpt:
void zend_compile_stmt(zend_ast *ast) /* {{{ */
{
 if (!ast) {
  return;
 }

 /* ... */

 switch (ast->kind) {
  /* ... */
  case ZEND_AST_IF:
   zend_compile_if(ast);
   break;
  /* ... */
  default:
  {
   znode result;
   zend_compile_expr(&result, ast);
   zend_do_free(&result);
  }
 }
}
/* }}} */

A new feature may not always require new abstract syntax, if this is the case, you can implement your RFC by using the AST, and Compiler API.

New abstract syntax will require a new compiler function, but may emit existing opcodes.

The VM


The virtual machine is the thing that executes opcodes; It's as much like a CPU, as opcodes are like instructions for a CPU.

For each opcode exists a handler, here is my illustrative excerpt:
ZEND_VM_HANDLER(43, ZEND_JMPZ, CONST|TMPVAR|CV, JMP_ADDR)
{
 USE_OPLINE
 zend_free_op free_op1;
 zval *val;

 val = GET_OP1_ZVAL_PTR_UNDEF(BP_VAR_R);
 
 if (Z_TYPE_INFO_P(val) == IS_TRUE) {
  ZEND_VM_SET_NEXT_OPCODE(opline + 1);
  ZEND_VM_CONTINUE();
 } else if (EXPECTED(Z_TYPE_INFO_P(val) <= IS_TRUE)) {
  if (OP1_TYPE == IS_CV && UNEXPECTED(Z_TYPE_INFO_P(val) == IS_UNDEF)) {
   SAVE_OPLINE();
   GET_OP1_UNDEF_CV(val, BP_VAR_R);
   ZEND_VM_JMP(OP_JMP_ADDR(opline, opline->op2));
  } else {
   ZEND_VM_SET_OPCODE(OP_JMP_ADDR(opline, opline->op2));
   ZEND_VM_CONTINUE();
  }
 }

 SAVE_OPLINE();
 if (i_zend_is_true(val)) {
  opline++;
 } else {
  opline = OP_JMP_ADDR(opline, opline->op2);
 }
 FREE_OP1();
 if (UNEXPECTED(EG(exception) != NULL)) {
  HANDLE_EXCEPTION();
 }
 ZEND_VM_JMP(opline);
}

New opcodes are going to need to new handlers, which you simply add to zend_vm_def.h, at the end of the list, using the next available opcode number.

The prototype for the macro ZEND_VM_HANDLER can be assumed:
    ZEND_VM_HANDLER(opnum, opname, op1_type, op2_type)

This is obviously not standard C code, this definition header file is used as input to generate the real handler functions for all the permutations of operand types used.

When changes are made to the header, you must regenerate the VM by executing zend_vm_gen.php (in the Zend folder).

The Hack


The first hack introduced in the screen cast was a hipster expression. The hipster expression looks like a function call (à la include_once). If the result of the expression passed to hipster is a string, the result is copied to the return value, or else an exception is thrown.

This is a completely useless feature, other than it's illustrative value.

Since we are introducing a new token "hipster", we must start by editing the lexer to include the following code:
<ST_IN_SCRIPTING>"hipster" {
 RETURN_TOKEN(T_HIPSTER);
} 

Care should be taken to make edits in sensible places in these generator input files.

The next thing to do, is making the parser aware of the new token.

Open up the parser generator input file and search for "%token T_INCLUDE", something like the following will be found:
%token T_INCLUDE    "include (T_INCLUDE)"

%tokens have the form %token ID FRIENDLY_NAME, we must add our new token:
%token T_HIPSTER "hipster (T_HIPSTER)"

Now search for "%left" (or scroll down), something like the following should be found:
%left T_INCLUDE T_INCLUDE_ONCE T_EVAL T_REQUIRE T_REQUIRE_ONCE

This section of the file is where operator associativity is set, and precedence implied in the order.

Since we are introducing a non-associative token we add the line:
%nonassoc T_HIPSTER

In an appropriate place, at the end of the first block of %nonassoc tokens, for example.

Now we must add a new rule to the parser in an appropriate place, since we are adding a new expression, we use the expr_without_variable rule.

In the example hack, we add:
 | T_HIPSTER '(' expr ')' { $$ = zend_ast_create(ZEND_AST_HIPSTER, $3); }

Underneath:
 | internal_functions_in_yacc { $$ = $1; }

Notice we are going to use a new kind of AST node, so the next thing to do is edit zend_ast.h to add our new ZEND_AST_HIPSTER to the enumerated _zend_ast_kind.

The enumerated kinds have a magical format that requires studying to determine where to add the new kind, we add it here:
        
    ZEND_AST_CONTINUE,
    ZEND_AST_HIPSTER,

Because our new node will have a single child, the expression.

The next thing to do is editing the switch in the zend_compile_expr compiler function, adding, in an appropriate place:
    case ZEND_AST_HIPSTER:
    zend_compile_hipster(result, ast);
    return;

Then we must define zend_compile_hipster:
void zend_compile_hipster(znode *result, zend_ast *ast) /* {{{ */
{
 zend_ast *expr_ast = ast->child[0];
 znode expr_node;

 zend_compile_expr(&expr_node, expr_ast);

 zend_emit_op(result, ZEND_HIPSTER, &expr_node, NULL);
} /* }}} */

At this point, the lexer, parser, and compiler are all aware of our new feature, but, we emit an opcode that does not exist.

So, we edit zend_vm_def.h to add our new opcode handler:
ZEND_VM_HANDLER(184, ZEND_HIPSTER, ANY, ANY) 
{
 USE_OPLINE
 zend_free_op free_op1;
 zval *op1;

 SAVE_OPLINE();
 op1 = GET_OP1_ZVAL_PTR_UNDEF(BP_VAR_R);

 if (Z_TYPE_P(op1) != IS_STRING) {
  zend_throw_exception_ex(NULL, 0,
   "hipster expects a string !");
  HANDLE_EXCEPTION();
 }

 ZVAL_COPY(EX_VAR(opline->result.var), op1);

 FREE_OP1();
 ZEND_VM_NEXT_OPCODE_CHECK_EXCEPTION();
}

Remembering to regenerate the vm after our change.

We're done, so execute make, and then marvel at:
sapi/cli/php -r "echo hipster('PHP 7 rocks') . PHP_EOL;"

Finally


The best way to learn about PHP internals is to dig in: If all of us got together in a room tomorrow, and did not leave until we had produced a volume of encyclopedic knowledge, the best way to learn would still be to dig in.

If everything were explained in exquisite detail here, you probably wouldn't have got to the end ... You may have to read this more than once, I'm sure you'll have to do a bunch of research besides ...

Hopefully, I've given you enough links, terms to research, and things to think about to inspire you to do just that ...

Happy hacking :)

7 comments:

  1. Thanks Joe, Really helpful article!

    ReplyDelete
  2. Great article, I wish you published it a few months ago, before I dived in :D

    PS : a funny typo in your article :
    "The fist hack introduced"

    ReplyDelete
  3. Awesome article! Thank u Joe!
    About a week ago, I take this article to my reading plan. When I read this just now, I'm very exciting and pity for reading latly. Can I translate it to Chinese and share to Chinese phper?

    ReplyDelete
  4. These are great lexical analysis that you include in your article and I think this will be useful for me while doing my first project are http://essaywriting.education/ in which it involves some coding terms to comply my writing perspectives

    ReplyDelete
  5. Thanks for the tutorial. Many moons ago I used yacc, and more recently been playing with ANTLR2.
    I'm not able to get your example to work probably around the `%nonassoc T_HIPSTER` code. I'll keep stabbing it until I can get it to bleed. Probably something stupid simple like a missing semicolon. Anyway thanks for sharing your knowledge.

    ReplyDelete