Wednesday, 13 February 2019

Parallel PHP: The Next Chapter


Some years ago, to prove some people on the internet wrong, and because I had a break from normal work - the first such break in years - I decided to write pthreads. My memory fails me a little, but from what I can recall, nobody actually saw that first version, I developed the idea over the following weeks and months and was allowed to publish this work to PECL. It was my introduction to serious internals programming.

The thing I was proving wrong is that PHP is not designed to be used in threads: This is flat out wrong and has been since the 22nd of May in the year 2000, when TSRM was merged into PHP. TSRM - Thread Safe Resource Manager allows you to build PHP such that it can be embedded in a threaded server, like Apache. These builds are colloquially referred to as ZTS - Zend Thread Safe. PHP is very much designed to be used in threads. You may hear people say that TSRM is unstable, or that it's not safe ... there were mistakes in the original implementation, like any software. But it is safe, and it is theoretically sound, and it has been in use on Windows as the primary mode of executing PHP since shortly after it was merged, necessarily so because of a lack of support for proper forking.

What it's not designed to do, or rather I should say what has been given no attention except by me, is exposing threads to userland. The reason for this is the architecture that PHP has, often referred to as "share nothing" seems to be antithetical to threads: Normally, when you start a thread, it operates in the same address space as the thread or process that created it, they share everything. I've been using threads in other languages for a very long time, and the very thing that other people think makes PHP an unsuitable candidate makes me think it more suitable than other languages. The thing that makes programming with threads hard is precisely that they share data, the cognitive overhead increases with each additional thread you create. The models you have to build in your head become unreasonable and prone to mistakes. That is why more modern languages like Go try to hide all of the complexities of threading behind a nice simple API, so that the programmer doesn't actually need to learn about the intricacies of how to use a condition variable, or a mutex, or when to synchronize, they only have to learn how the simple API works.

Having never written a threading API before pthreads, and being left entirely on my own to do it even when the code became public, maybe I made some questionable decisions. I couldn't accept that using parallelism could be easy, I would repeat like a mantra that threading is hard and API's can't solve that. I wanted pthreads to expose the same kind of API that Java has, and my focus could not be shifted by reason. I vaguely remember the first time I went into IRC on internals to talk about pthreads, and people, including Rasmus, tried to reason with me that I was maybe making a mistake, that threading at the frontend of a website doesn't make sense, others said they would have preferred a simpler API ... these pleas fell on deaf ears, and I regret it. I spent many hundreds, possibly thousands of hours writing and rewriting pthreads until it is what you see today, a kind of monster that about 4 people really understand excluding myself, that only the same number of projects have really managed to deploy with any success.

A slight tangent: Threading at the frontend of a website, on the face of it, doesn't make sense: If you have a web request that creates a "reasonable" number of threads, let's say 8, and 1000 clients come along at once, you are asking your hardware to execute 8*1000 threads not including the threads or processes that done the creation, this is unreasonable and cannot possibly scale (with a 1:1 threading model, which is what pthreads has). That said, other languages do manage to integrate parallelism into the web response, but it's tricky, and takes a lot of thought and expertise. I've never suggested that you should build software in this way, and would never suggest it, eventually I prohibited the use of pthreads in anything but the CLI in an attempt to force users to be reasonable.

Tangent over: The great thing about being wrong, and acknowledging that you are wrong, is that you have the opportunity to learn, and do better. If we were never wrong, programming would be boring, we would just chug out code for our entire lives, and miss out on the feeling of truly understanding something for the first time, a feeling I love and cherish, and that keeps me at my keyboard.

I've been aware of my mistakes for some time, but pthreads does have some large projects relying on it, and mostly it's development is controlled by other people now. Occasionally I will give advice or commit something, but I can't, in good conscience, tell anyone to use it, it's simply too hard.

While aware, there was nothing driving me to write another API, until recently when Zend made their intention to merge the JIT  they have been working on for years into PHP. Just think for a moment what it would mean to be able to execute machine code in parallel in user land PHP ... this is not a thing I could have ever imagined happening all those years ago, but it would seem a possibility in the not too distant future.

pthreads has to jump through so many hoops to make threads work and provide the API that it does, that it's not reasonable to talk about it being able to execute machine code.

Recently, I set to work on a new threading API, named Parallel, it is not an exact clone of any existing threading API, it is an API focused on being simple and hiding the complexity inherent in utilising parallelism in your application, it is also focused on being forward compatible with the JIT, for that day when we can actually execute machine code in userland and in parallel.

I should mention now that the current implementation of the JIT, which is not finished, doesn't actually have proper support for TSRM, a fact that only became evident in the past few days. However, the conversation some of us had with Dmitry (the author of the JIT) about the lack of support for ZTS in JIT has inspired him to look at ways to make ZTS with and without a JIT much more efficient. So, it's highly likely that this support will come, although I can't say when.

While the JIT is an exciting prospect for parallelism, I'm also excited to be able to provide a really nice, simple API, that any PHP programmer can understand and use. For the next year at the very least, the JIT doesn't exist for most PHP developers, and you can get to work getting to know how to use parallel ...

Parallel is not complete, but is stable: More features are planned, something like Go's channels would be a nice addition, and I've already started to think about and discuss the implementation of this with other internals developers.

I wish you all the best of luck ... and for those people thinking about using pthreads for new code: don't be silly.

7 comments:

  1. That's awesome, looking forward to giving Parallel a try. One question though, does that means that pthreads will be deprecated and not supported in the near coming future ?

    ReplyDelete
  2. Joe! Thank you for writing pthreads! I am one of the silly people who have used it in a couple of projects. Yeah it took a little tinkering to get there but I had a vision of how I wanted these to work and pthreads enabled me to do that so thanks.
    I'm interested to take a peek at Parallel now!

    ReplyDelete
  3. I had a lot of fun using pthreads in my pet projects. It helped me understand more of threaded computing. Thank you for your work!

    ReplyDelete
  4. I gave pthreads a shot when creating Rubix ML, a machine learning library for PHP. https://github.com/RubixML/RubixML

    There were many aspects I liked including the object oriented API, but ultimately the decision not to include it in the library was because of support.

    I'm looking forward to seeing Parallel PHP. Feel free to drop me a line so we can experiment on some learning algorithms.

    ReplyDelete
  5. The native functions will further help the developers to keep the application code clean and readable. They can easily gather information about the native functions and their usage by referring to the PHP user manual.WordPress design Toronto

    ReplyDelete
  6. Hah, fun to read this considering I literally just started a new project with pthreads - it's pretty simple though, only spawning a couple workers listening to a Rabbit MQ server. They don't share any data beyond the initial setup, so it runs mostly fine (had some occasional segfaults, but they magically disappeared at some point). Good to know I maybe should migrate the whole thing to separate processes soon ;)

    ReplyDelete
  7. So would it be accurate to say that pthreads just starts additional threads, but they share the same environment, and therefore can easily be unsafe if you don't know how to do multi-threading safely, while parallel starts a new process, with its' own environment, and so it is harder to accidentally create parallelism bugs? Or is it more correct to just say that parallel is a simplified API? Also, wouldn't building a new process every time you want to run something in parallel cause a lot more performance issues (it has to start the process, run the autoloader, include classes all over again from scratch)? I mean I guess that is way safer, but I imagine there will be times you just want to run some code that you know if composed entirely of pure functions. Also, for the closures running in parallel to not be able to accept or return objects, this basically ruins their ability to be or provide a good API, at least compared to the code I write using a technique I call type oriented programming. I would rather be able to pass and return objects that I designed to be immutable.

    ReplyDelete