Friday, 20 November 2015

APC and Me

Fig1. An APC logo.

When it was decided that Zend's Optimizer Plus would be merged into PHP, APC was already in a pretty poor state, there hadn't been a stable release for quite some time.

We were moving towards having a built in (abandoned in php-src/ext) opcode caching solution but it was not obvious that APC was going to keep being maintained.

I think all of us assumed that the bugs that were being experienced were entirely down to the opcode caching parts of APC, since they were the most complex and most frequently altered parts of APC.

So, one day (few nights), I stripped APC of opcode caching.

One of the things I decided I would tidy up was the implementation and usage of locking.

The supported kinds of locks are:
  • File locking
  • Mutex
  • Read/Write Locks
  • Spin Locks
Brief explanations follow.

File Locking


I'm going to assume that everybody reading knows what this is, and even without experience, can sense that it is probably the most inferior in the list.

I sincerely hope that nobody uses file locking today.

The reason it exists is because people deploy PHP in all kinds of places, places we don't get to hear about until something goes wrong. Those places might not have support for anything other than file locking, so it stays.

Mutex


This is your most basic kind of synchronization. Mutex means Mutual Exclusion, so we know that this kind of synchronization is exclusive.

This was the default locking for APC

Read/Write Locks


Read/Write locking allows a shared lock to be acquired for reading, this means many readers can be supported without exclusion. An exclusive lock is only required for writing.

This is the default for APCu

Spin Locks


Speaking as a programmer who spends a lot of time writing multi-threaded code, in various languages: A spin lock is about the worst kind of synchronization imaginable, it is basically a predicated busy wait loop.

This remains in APCu for the same reason file locking does, and I actually have heard of people using it and can't convince them to do otherwise.


Scary Things


I didn't know the APC code base before that first night, had never read any of it before.

Whenever a programmer reads code from a prospective project that another programmer wrote, they have criticisms. Some of them are just our ego talking, some of them are wrong, some of them are probably a mechanism to motivate us to keep working. Hardly any of them are worth mentioning, or doing anything about.

The task at hand is to get the thing working, not fix every problem that nobody ever had.

So I cleaned up, I worked on it for a few days and pushed it to pecl.


OMG


So, this blog post started out as something completely different.

APCu was demonstrably unstable, and I thought this was because it broke a basic rule. It seemed to acquire a read lock, when it should be acquiring a write lock.

There seemed to be a clear path to race conditions, in other words.

While trying to find the code in the APC code base that caused the problem I made this search: apc_cache_release

The APCu version omits atomicity, it omits safety; In my haste to clean up, I have made a horrible mistake.

There was even a pull request, from one of the elders of PHP, that I chose not to merge, for three years.

The worst thing about being wrong is that I inevitably feel dumb as rocks.

The best thing about being wrong is things that didn't make sense before, like this, or this, and many other bugs besides, start to make sense.

So, I've found the reason that APCu was unstable, it was me.

This probably caused a problem for a fairly large number of people.

Sorry about that, I'll get to fixing it ...

I'm afraid I don't have time to blog now, too much work to do :(

Thursday, 12 November 2015

The Problem with Caching

Fig 1. Stampeding Elephants.
We cache things to avoid unnecessary load on our servers. It might therefore surprise you to learn that when you are most vulnerable, the kind of shared memory cache that is APC(u) will stab you in the face ...

APC(u) has had stampede protection for a long time, however, it is perniciously named, and ineffective when it is most needed.

You are most vulnerable to high load when the cache is empty,  because the work you were trying to avoid executing unnecessarily by caching it's result, must execute.

The problem is that the expensive-work-worth-avoiding is going to be executed in more than one process; It will be executed by however many processes you are using that are able to spin up in the time it takes the first process to warm the cache.

If you have large pools of processes, or generation of the entry is expensive, this becomes a very real problem.

The stampede protection built into APC(u) only stops a stampede of the cache, it doesn't stop the stampede, or competition, for CPU time while dozens, possibly hundreds, of processes are attempting to execute the very same code paths, only to fail to store the data, because the current stampede protection will prohibit the write, for most of the processes anyway.

What not to do ...


The most obvious, but most dangerous solution would be to expose a lock and unlock function to userland PHP.

If some process calls lock, and before it gets to call unlock, experiences an uncaught exception, or some other fatal error, it will deadlock the server.

Since so much can result in that kind of fatal error, it doesn't seem worth the risk, considering the price of failure is catastrophic deadlock.


The Solution


APCu >= 5.1.0 (PHP7+) will have the following function:

function apcu_entry(string key, callable generator, int ttl = 0);

If the entry identified by key cannot be found (or is invalidated by the ttl it was stored with), generator is called and the result stored with the optionally specified ttl.

All of this is done while an exclusive lock is held on the cache; This means that if many processes hit the same code path, only one of them will carry out generation.

Used correctly this allows truly atomic cache warming to take place, like so:
<?php
$config = apcu_entry("config", function($key){
 return [
  "child" => apcu_entry("config.child", function($key) {
   return "Hello World";
  }),
  "next" => apcu_entry("config.next", function($key) {
   return "We've only just begun ...";
  }),
  /* ... */
 ];
});

var_dump($config);
?>
Recursion is only natively supported when rwlocks are disabled (--disable-apcu-rwlocks)  at build time, it is otherwise supported with a thread local counter.

This requires much testing, consider this a call for testers ...

Tuesday, 10 November 2015

Help Required with Broken Things

Fig 1. A priceless artefact from antiquity.

Humans are a terrible bunch; Give them ancient, priceless artefacts to care for, and they'll snap the beards off them and stick them back together with pound shop (99c store) epoxy.

In August 2014, that literally happened.

Apparently, the museum has world class conservation equipment, and experts. So repair was carried out internally, at the museum.

Imagine that you were responsible for advertising for the job of "Repair Man for King Tut's Beard".

What would that advert look like ?

Fig 2. Required: Repair Man for King Tut's Beard

Using just the features of blogger, I have conveyed everything important about the problem with a picture and a few words.

The words could have been almost anything "Help Required with Broken Things" would have done the job.

Asking for Help

 

Every one of us depends directly on code that was authored by somebody else. That may have always been the case to some degree, but in the modern ecosystem, we can directly contact the authors the majority of the time; We can open a bug on github, or some other bug tracking software.

Directly interacting with the authors of some code, so quickly, and with such little effort, feels pretty new to me; a product of our collaborative ecosystem.

Another, less obvious product, might be the programming language barriers that are erected by interactions between ecosystems that support, mutually or otherwise, other ecosystems; Such as the internals ecosystem which speaks in C, and the userland ecosystem which speaks in PHP.

Language barrier or not, we appear to make assumptions about the authors of some code, that we know are not true for code that we have authored ourselves.

This can lead us to omit almost everything important from our bug reports, or when asking for help ... 

Our bug reports and questions can tend to have content that amounts to "Help Required with Broken Things".

The Ideal

 

If a bug is caused or effected by configuration or environment, then that's important information, however it is rarely enough to describe the configuration, environment, or even content of your code.

In the cases where only description is enough, it is obvious.

In cases where we think code is paramount, we are mostly aware.

What we are not always sure of, is how to include "our whole application" in a bug report, or question.

Stackoverflow has an excellent description of what reproducing code should look like, and even how to create it.

The Real World

 

We don't live in an ideal world, and we can't always create MCVE's.

I don't want to discourage anyone from reporting bugs or asking questions, at all.

So, open your issue, report your bug, ask your question, with whatever information you have.

But, you already know that it may not be actionable until it contains a MVCE.

A worthy observation is that in the course of trying to create an MVCE, you will, at the very least, find that you are able to describe the problem in more and more detail.

Trying can either lead to reproducing code, or the kind of detail that might make your query actionable.

Please try ...