Wednesday, 30 June 2021

Only Complete Applications

Fig 1. PFA RFC


Today the vote will close on Partial Function Application, and the feature has been refused.

It was less effort to write a debugger for PHP than it was to make partial function application work! 

The debugger, a few of us wrote in a few days. Partial application took up many weeks of my life, including most of the night time.

It actually ran me into the ground. 

I'm not a very good communicator. I like blogs because I can perform an editorial process, re-arrange my thoughts and move toward the perfect words. 

But, other humans, in general, baffle me. I can't tell how other people think, and what they know, so I can't understand what they are saying or what they need me to say a lot of the time. When you ask me a question, even if I definitely know the answer, I'll spend some time paralyzed by anxiety, to some degree caused by all my previous failures to communicate with humans properly.

A lot of questions were being asked of me the whole time, and this took a toll ... simply put, I spent quite a lot of this time anxious, exhausted, and sad.

In addition, I became physically sick toward the end with suspected Covid. But, I tested negative, had a 24 (maybe 20) hour break and more or less carried on at the same pace.

It is a huge understatement to say I put a lot of effort into this thing ...

In that time, we had at least two different full implementations of the idea. 

The first implementation had only one placeholder, the complexity of the implementation at this point was quite low. However, it resulted in semantics that apparently didn't make sense to anyone. The main gripe being that the placeholders semantics depended on its position in a list of arguments.

Any implementation of this is so involved with details of the engine that inevitable bike shedding ensued.

Rather than looking around at other languages, we decided we needed multiple placeholder symbols, support for partial application of the new operator, out of order application by supporting named arguments, and even named placeholders - essentially changing the feature into something other than partial application - that's actually very definitely function (API) redefinition.

Between the first and last implementation, I attempted to move toward the semantics people wanted, while retaining as much simplicity in the implementation as possible.

At one point, I had an implementation that supported all of the crazy things people were bike shedding about, including re-ordering named parameters (necessary for named placeholder support). 

What became clear in this time is that we needed to define the semantics in such a way that limits complexity.

We dropped named placeholders, and settled on two symbols with easy to understand semantics. While we're left with semantics and rules that you can write in a few lines, it only limits complexity, you are still left with something complicated.

Then we get to the last implementation, which I stayed awake for more than 30 hours, while sick, to write.

I made some glaring (to some people) omissions, but overall we had a solid implementation with easy to understand semantics and soon after the vote started.

Why did this fail ?

If you asked me why it failed, I would have to say bike shedding is somewhat to blame.

Read the next sentence with the logical bit of your brain:

People who don't know how to implement something are not well equipped to decide how that something should work.

This seems to be an obvious truth, but may come across as elitist ... I don't actually care. 

Elitism is good - You want an elite doctor to perform your eye surgery, you want elite scientists doing the research that will save the world. I'm all for elitism ...

I'm not saying we shouldn't listen to feedback - I did remember, even when it was detrimental to the implementation, not to mention detrimental to my physical and mental health.

I know that the people in the bike shed, making suggestions, making complaints, in some cases explicitly bike shedding ("I know this is bike shedding, but"), they think they are helping the conversation along by talking about things they understand, while ignoring all the stuff they don't understand outside the bike shed. 

I'm willing to admit that sometimes they do move the conversation along, but think it's by accident; The conversation would have moved along anyway, possibly faster if they hadn't intervened.

Here, bike shedding resulted in a lot of wasted time, that's a fact.

The other reason is complexity. There are two distinct kinds of complexity here:
  • language complexity - what do people that are writing PHP have to know in order to understand code containing partial application ?
  • implementation complexity - what do people that are maintaining, debugging, or developing the engine have to know ?
When it comes to language complexity, this is mostly determined by semantics. Once we landed on semantics we can explain in a few sentences, we've reduced that as much as we can.

When it comes to internal, implementation complexity ...

Why is this complicated ?

We can all knock up a class in 10 minutes that performs something that looks a bit like partial application, we can do that in userland.

What it won't be, is partial application: You cannot do the things we do internally from userland; You don't have the ability to rebuild prototypes (in any sensible way), manipulate the stack, interact with the GC in certain ways, and a list of a million other things.

The fact is that any proper implementation of partial application is inherently complicated.

Some of that complexity is due to the semantics you choose for placeholder symbols (even when reduced as much as possible), and some is due to the interactions between the engine and this new kind of object, created by interrupting a call where the engine does not expect this kind of interruption.

If we're going to have a proper implementation of partial application, that retains type information, is efficient (cumulative, as partial application is meant to be), and has semantics that are useful and easy to understand, then the implementation carries with it complexity that cannot be reduced.

Was the right decision made ?

Yes.

Although other contributors to the RFC were focused on the use case of pipes, I don't really even like pipes.

My motivation for doing any of this, is that it was interesting to write. My motivation for wanting it to be actually merged is that I'm interested in the use cases that would have been found beyond those we suggested. I don't know what they look like, and guess I'll never find out.

It's highly likely those use cases, which I imagined existed, simply do not exist.

Those voters that could think of use cases, but voted no because they couldn't look past language or implementation complexity, were absolutely right to do so.

Complexity must be justified, and if it isn't, we should not add the feature.

That's all I have to say about that right now ...

Peace out phomies :)

Monday, 28 June 2021

Literally Internals

Fig 1. Some Magnified Strings

How much magnification does it take to make something quite tidy, like a piece of string, look an utter mess ?

There is an RFC in progress called is_literal which I'm providing the implementation for.

I want to talk a little bit about that ...

Where did it start ?

You imagine, I'm going to talk about the RFC now. But I'm not; It started 25 and some odd years ago.

When we come to write an RFC, we have to deal with PHP the way that it is today, after more than a quarter of a century of development. In particular, and most importantly, we have to deal with extremely aggressive optimizations to the source that have been performed since NG, we also have to deal with optimizations performed by Opcache and the subtle differences that Opcache introduces. 

It can be difficult to make changes in this system without inadvertently effecting other parts of the system - and so code - that is not even using the feature.

Sometimes, it's possible to add something quite complicated and not have an effect on the complexity or functionality of the rest of the engine. 

For example, as complicated as the internals of the Fiber implementation are; They are self contained, mostly don't effect other parts of the engine, and we're not responsible for the maintenance of the most complicated code it uses. There is still complexity above and beyond the code we don't own (boost owns it), but it looks manageable because it's contained.

What is going on ?

The is_literal RFC seeks to provide a tool for userland that can help to avoid injection vulnerabilities, where strings composed of literal values and user provided input (strings) may lead to injection.

It would seem simple enough to make a flag on literal values, that the programmer typed in their code, and allow them to detect the literalness of any variable at runtime.

But we have all of history bearing down on us, and it's not so straight forward.

We began to focus on strings alone, the reason is that we can find space on the structure that represents a string to set a flag, and avoiding user input strings is obviously required for any implementation.

Because of optimizations in NG - scalars with types below string are stored on the stack, and are not refcounted - and one optimization that came after, there is no usable space on every variable for a flag.

There is space, but in order to use it, we would have to disable an optimization that assumes there is only one flag set in the only place where we could set a flag. This would not be an acceptable implementation detail, and so is not possible.

For string support to be generally useful, the engine must produce a literal where all of the input to an instruction (or function) is literal. This allows the programmer to reason easily about how concatenation (or other functions that are literal aware) work.

Concatenation is how people tend to build their queries, even if they are using parameterized queries, even if they are using a query builder, concatenation is still used.

What's the problem ?

Early on in the discussion, a couple of people requested that we allow strings and integers to be concatenated to produce a literal. When this was requested, at least one person who requested it knew that we wouldn't be able to track the source of integers.

I didn't like this idea at first, it's less than pure. Nevertheless, I spent rather a long time thinking about it, before determining that nothing dangerous can happen - if we're talking about injection - if you concatenate a string and an integer, it cannot lead to injection.

At compile time, before any user input has been provided, the engine may optimize certain concatenations and even function calls, that produce literal strings, and may contain types other than string. In other words, the engine is allowed to concatenate whatever it wants if it can determine there are no side effects and it has all of the information available to perform the concatenation (or indeed, function call) early. None of these concatenations may include user input. So it's "safe" in the narrow sense that the programmer provided all of the data, there may be a mistake in the query, but an injection is not possible.

I came around to seeing that including support for integers, even though we are not able to track their source, creates some symmetry - runtime concatenation or calls will behave the same way as the compiler and opcache with regard to string and integer values.

A wave of fear washed over internals, and some very loud people objected, and coloured the conversation.

I'm not a security expert, just a code monkey. I tried to reason with some of these people and it failed hard.

So I reached out to somebody that everybody recognizes as a security expert in the PHP ecosystem, Scott Arciszewski from Paragon Initiative Enterprises. 

I was quite ready to admit I was wrong when I asked them the question "Is it reasonable to include support for concatenating a string and an integer even when the source of the integer is unknown?"

Here's an excerpt from their response to the mailing list:

Injection attacks (SQL injection, LDAP injection, XSS, etc.) are, at
their core, an instance of type confusion between data and code. In
order for the injection to *do* anything, it needs to be in the same
input domain as the language the code is written in. Try as you might,
there is no integer that will, upon concatenation with a string,
produce a control character for HTML (i.e. `>`) or SQL (i.e. `'`).

I really thought this would help. I know that lots of the people reading internals aren't security experts, and clear words, and clear thoughts, from someone who is should help them to make good decisions.

It didn't really help, people just started to argue  ... which I found embarrassing ... 

Because of a bad naming decision (for a little while, the RFC was called is_trusted), there is this idea that if is_literal (or whatever you want to call it) returns true, the value is safe to use in all circumstances, that not only should it be free of injection, but it should be free of mistakes.

What we wanted at this point was to rebrand, we wanted to frame the thing we are introducing as the concept of Nobility. The name and idea having been suggested by Scott.

We never got to do this, but I think it would have been our best move. We get to define what nobility is, what kind of data it includes, and how they interact (or fail too, because noble) with other variables.

Instead, we had to remove the support for integers that made the feature easier to reason about and more generally useful.

Where are we now ?

Without support for integers, we are left with something that may look inconsistent if you pay close enough attention.

We cannot disable the optimizations in the compiler or opcache that lead to the production of literal values inclusive of integers (and other types). That would obviously not be an acceptable implementation detail.

So now, whether or not the engine produces a literal depends on the very fine details of how you wrote the code or performed a call. Without detailed knowledge of the engine, this makes is_literal look unpredictable and difficult to reason about.

In addition, we've broken a basic, and safe (in the narrow sense we are talking about) use case - you can no longer rely on the concatenation of a string and an integer producing a literal.

What do we do now ?

I'm not sure. 

I'm disappointed that the expert opinion I solicited did not change the direction of the conversation. If you're not going to listen to an expert in the field of security, about security things, I think you're not really going to listen to anyone, you consider yourself the expert maybe.

I would like to be free to define the concept of nobility, and I'd like people to approach that discussion armed with the expert advice we've had, and free of the notion that we are trying to protect you from mistakes in general.

I'm unsure of our next move ...

Peace out Phomies.

Saturday, 19 June 2021

Wasting Time

 

Fig 1. A Bin

Most days, I try to find some time to work on PHP. I consider it my mission to push this thing forward. Recently, I've also made it my mission to get your boss to pay you to push this thing forward. 

One of the problems with this, that I hear all the time:

What if we allow one of our employees to spend a bunch of time working on a feature only for that feature to be refused ?

It looks like, I'm asking you to potentially waste your company resources on things that might never get voted into PHP.

How it Works

For those of you that don't know, I'm going to lay out the path for a feature, from inception to inclusion:
  1. You think of a feature (or borrow one from another language)
  2. You approach internals (by sending a mail to a mailing list or opening a PR)
  3. You try to gather consensus on the mailing list and the PR
  4. You request access to create a Wiki document (an RFC)
  5. You spend two weeks (at least) discussing the addition on internals and responding to feedback
  6. You open a vote, which lasts two weeks.
A two thirds majority (two yes votes for every no vote) is required for the feature to pass and be included.

Minimally then, I'm asking you to spend at least a month, likely taking up at least some time every day to respond to conversation on internals or queries on the pull request.

At the end of this, if the feature doesn't get voted in, have you wasted your time ?

The Question

Whether or not you feel like your time has been wasted depends on what you think you are doing when you propose an RFC.

There are a lot of questions that can be answered instantly, for example: Who is the best power ranger ? It's the pink one, obviously. Other questions may not be answered in your lifetime, such as Can we eradicate cancer ?

The question you ask when you propose an RFC is somewhere in between, it's going to take at least a month to answer, and is more complicated than choosing the best power ranger.

The question is this:
Do we want to include this feature, as proposed, at this time ?

If you consider this the question, and consider the RFC process the means by which to answer the question, whatever the answer to the question is, you haven't actually wasted any time - You set out to answer a question and done so.

The Reality

Obviously, you are likely to be biased as the proposer of the feature, and would prefer a positive response. But, it's important to note that a negative result doesn't equate to "We never want this feature".

There may be things you can do to change your proposal so that it is more palatable for internals, and so more acceptable at a later time.

When you spend a month or more working on something, and thinking about it, you become invested in the idea, and I recognize that. But, you are one person (or a small group of people) acting on behalf of millions. When things don't go the way you would prefer, you have to accept that you are simply wrong.

There's nothing wrong with being wrong, it just means there's more to learn, more work to do, or more to understand. Since learning, working, and understanding are things that, as programmers, we enjoy, it may even be better to be wrong than right, in a sense.

It may not be obvious, but even if your feature doesn't get in, you have pushed PHP forward a little bit: By dispersing your ideas you may inspire others, you may have come up with the answer to a question that hasn't been asked yet, and you've answered the question you set out to answer.

The code you wrote may end up in the bin, that is an unavoidable fact. 

However, that doesn't at all mean that you've wasted your time.

That's all I have to say about that right now.

Peace out phomies :)

Wednesday, 9 June 2021

Untangling Fibers

Fig 1. Some Fiber

 

Fibers are going to be available from PHP 8.1, they were voted in 50 to 14 ... I was one of the 14.

I've said before that I think merging Fibers was a mistake, but at this point, it doesn't matter what I think or thought: They are in fact part of the source code, and we have to work on this code together. So, I've tried to familiarize myself with evolving implementation details, be involved in the conversation, review pull requests, and generally make myself as useful as possible.

During the discussion phase of the RFC, the Swoole maintainers made clear that they did not approve of the implementation for various technical reasons.

There now appears to be some perceived friction with internals.

I want to deal with a couple of comments that I've heard repeated, or sentiments expressed:

Swoole is being treated with mistrust, possibly because it's Chinese.

This is a bizarre statement, and doesn't bear any resemblance to the truth.

While there may be some language barriers that make it hard for us to communicate in human words, especially on very technical matters: All of us speak the language of C. Even if nobody from Swoole spoke a word of English, we could, and would try to communicate in code.

There are also many members of internals whose first language is not English. There are pull requests that come with no words at all for that reason - albeit much simpler in general. Nevertheless, it illustrates that we do not need to understand each others tongues as much as you might think.

The idea that we would, should, or are exhibiting mistrust toward anyone because of their country of origin is abhorrent.

Swoole have years of experience in this field and we are discounting their opinions.


Swoole is a very clever extension, very clearly written by skilled and driven individuals, it's also a much more complete solution. I've also heard suggestion that because it is a complete solution it makes sense to adopt it.

Swoole maintainers have many years of experience in the field of developing an extension that provides a complete solution to the problems of co-op PHP. 

However, when internals votes, and votes with the kind of turnout the Fibers RFC had, we do so with maybe a couple of hundred years of collective experience in the field of developing a programming language.

Developing a language is the relevant field here. 

When we are writing extensions and we make them do bad things - and I know a little bit about extensions that do bad things - we make our peace with it, because they make our cool idea work.

We don't import that level of magic into PHP, it just isn't going to and should not happen.

Internals voted on the simple, bare bones implementation of Fibers quite purposely. When we done that, we left open many questions that the Fiber RFC did not intend to answer, purposely, around how to actually deploy Fibers in an application.

The adoption of a thing that looks like (all of, or any significant part of) Swoole, whether it comes in one part or one hundred parts is utterly out of the question because there is no mandate for that.

This does not mean that we don't find their feedback important, and useful: It means, quite clearly, there is some disagreement about the question of preparing the rest of PHP for Fibers. Swoole having found answers to those questions naturally think they have the correct answers, because they work.

I'll just re-iterate, and clarify: What happens in extensions, stays in extensions ...

What is actually happening ?


Usually, we like an implementation for an RFC to be ready at merge time. 

In this case, because of the obvious need to develop the internal API and implementation details of Fiber, we decided to merge the implementation as it was, and iterate on that implementation with small pull requests - like most projects, we prefer small focused pull requests.

We want to move the implementation in a direction that is useful to everyone, within the confines of the mandate that the RFC gave us.

We are trying hard to do that, there's no ulterior motive to exclude anyone, and no mistrust, or anything of the sort.

We are simply trying to work together on this thing .... that is all.

That's all I have to say about that right now ...

Peace out, phomies :)