Wednesday, 13 February 2019

Parallel PHP: The Next Chapter

Some years ago, to prove some people on the internet wrong, and because I had a break from normal work - the first such break in years - I decided to write pthreads. My memory fails me a little, but from what I can recall, nobody actually saw that first version, I developed the idea over the following weeks and months and was allowed to publish this work to PECL. It was my introduction to serious internals programming.

The thing I was proving wrong is that PHP is not designed to be used in threads: This is flat out wrong and has been since the 22nd of May in the year 2000, when TSRM was merged into PHP. TSRM - Thread Safe Resource Manager allows you to build PHP such that it can be embedded in a threaded server, like Apache. These builds are colloquially referred to as ZTS - Zend Thread Safe. PHP is very much designed to be used in threads. You may hear people say that TSRM is unstable, or that it's not safe ... there were mistakes in the original implementation, like any software. But it is safe, and it is theoretically sound, and it has been in use on Windows as the primary mode of executing PHP since shortly after it was merged, necessarily so because of a lack of support for proper forking.

What it's not designed to do, or rather I should say what has been given no attention except by me, is exposing threads to userland. The reason for this is the architecture that PHP has, often referred to as "share nothing" seems to be antithetical to threads: Normally, when you start a thread, it operates in the same address space as the thread or process that created it, they share everything. I've been using threads in other languages for a very long time, and the very thing that other people think makes PHP an unsuitable candidate makes me think it more suitable than other languages. The thing that makes programming with threads hard is precisely that they share data, the cognitive overhead increases with each additional thread you create. The models you have to build in your head become unreasonable and prone to mistakes. That is why more modern languages like Go try to hide all of the complexities of threading behind a nice simple API, so that the programmer doesn't actually need to learn about the intricacies of how to use a condition variable, or a mutex, or when to synchronize, they only have to learn how the simple API works.

Having never written a threading API before pthreads, and being left entirely on my own to do it even when the code became public, maybe I made some questionable decisions. I couldn't accept that using parallelism could be easy, I would repeat like a mantra that threading is hard and API's can't solve that. I wanted pthreads to expose the same kind of API that Java has, and my focus could not be shifted by reason. I vaguely remember the first time I went into IRC on internals to talk about pthreads, and people, including Rasmus, tried to reason with me that I was maybe making a mistake, that threading at the frontend of a website doesn't make sense, others said they would have preferred a simpler API ... these pleas fell on deaf ears, and I regret it. I spent many hundreds, possibly thousands of hours writing and rewriting pthreads until it is what you see today, a kind of monster that about 4 people really understand excluding myself, that only the same number of projects have really managed to deploy with any success.

A slight tangent: Threading at the frontend of a website, on the face of it, doesn't make sense: If you have a web request that creates a "reasonable" number of threads, let's say 8, and 1000 clients come along at once, you are asking your hardware to execute 8*1000 threads not including the threads or processes that done the creation, this is unreasonable and cannot possibly scale (with a 1:1 threading model, which is what pthreads has). That said, other languages do manage to integrate parallelism into the web response, but it's tricky, and takes a lot of thought and expertise. I've never suggested that you should build software in this way, and would never suggest it, eventually I prohibited the use of pthreads in anything but the CLI in an attempt to force users to be reasonable.

Tangent over: The great thing about being wrong, and acknowledging that you are wrong, is that you have the opportunity to learn, and do better. If we were never wrong, programming would be boring, we would just chug out code for our entire lives, and miss out on the feeling of truly understanding something for the first time, a feeling I love and cherish, and that keeps me at my keyboard.

I've been aware of my mistakes for some time, but pthreads does have some large projects relying on it, and mostly it's development is controlled by other people now. Occasionally I will give advice or commit something, but I can't, in good conscience, tell anyone to use it, it's simply too hard.

While aware, there was nothing driving me to write another API, until recently when Zend made their intention to merge the JIT  they have been working on for years into PHP. Just think for a moment what it would mean to be able to execute machine code in parallel in user land PHP ... this is not a thing I could have ever imagined happening all those years ago, but it would seem a possibility in the not too distant future.

pthreads has to jump through so many hoops to make threads work and provide the API that it does, that it's not reasonable to talk about it being able to execute machine code.

Recently, I set to work on a new threading API, named Parallel, it is not an exact clone of any existing threading API, it is an API focused on being simple and hiding the complexity inherent in utilising parallelism in your application, it is also focused on being forward compatible with the JIT, for that day when we can actually execute machine code in userland and in parallel.

I should mention now that the current implementation of the JIT, which is not finished, doesn't actually have proper support for TSRM, a fact that only became evident in the past few days. However, the conversation some of us had with Dmitry (the author of the JIT) about the lack of support for ZTS in JIT has inspired him to look at ways to make ZTS with and without a JIT much more efficient. So, it's highly likely that this support will come, although I can't say when.

While the JIT is an exciting prospect for parallelism, I'm also excited to be able to provide a really nice, simple API, that any PHP programmer can understand and use. For the next year at the very least, the JIT doesn't exist for most PHP developers, and you can get to work getting to know how to use parallel ...

Parallel is not complete, but is stable: More features are planned, something like Go's channels would be a nice addition, and I've already started to think about and discuss the implementation of this with other internals developers.

I wish you all the best of luck ... and for those people thinking about using pthreads for new code: don't be silly.