Monday, 28 June 2021

Literally Internals

Fig 1. Some Magnified Strings

How much magnification does it take to make something quite tidy, like a piece of string, look an utter mess ?

There is an RFC in progress called is_literal which I'm providing the implementation for.

I want to talk a little bit about that ...

Where did it start ?

You imagine, I'm going to talk about the RFC now. But I'm not; It started 25 and some odd years ago.

When we come to write an RFC, we have to deal with PHP the way that it is today, after more than a quarter of a century of development. In particular, and most importantly, we have to deal with extremely aggressive optimizations to the source that have been performed since NG, we also have to deal with optimizations performed by Opcache and the subtle differences that Opcache introduces. 

It can be difficult to make changes in this system without inadvertently effecting other parts of the system - and so code - that is not even using the feature.

Sometimes, it's possible to add something quite complicated and not have an effect on the complexity or functionality of the rest of the engine. 

For example, as complicated as the internals of the Fiber implementation are; They are self contained, mostly don't effect other parts of the engine, and we're not responsible for the maintenance of the most complicated code it uses. There is still complexity above and beyond the code we don't own (boost owns it), but it looks manageable because it's contained.

What is going on ?

The is_literal RFC seeks to provide a tool for userland that can help to avoid injection vulnerabilities, where strings composed of literal values and user provided input (strings) may lead to injection.

It would seem simple enough to make a flag on literal values, that the programmer typed in their code, and allow them to detect the literalness of any variable at runtime.

But we have all of history bearing down on us, and it's not so straight forward.

We began to focus on strings alone, the reason is that we can find space on the structure that represents a string to set a flag, and avoiding user input strings is obviously required for any implementation.

Because of optimizations in NG - scalars with types below string are stored on the stack, and are not refcounted - and one optimization that came after, there is no usable space on every variable for a flag.

There is space, but in order to use it, we would have to disable an optimization that assumes there is only one flag set in the only place where we could set a flag. This would not be an acceptable implementation detail, and so is not possible.

For string support to be generally useful, the engine must produce a literal where all of the input to an instruction (or function) is literal. This allows the programmer to reason easily about how concatenation (or other functions that are literal aware) work.

Concatenation is how people tend to build their queries, even if they are using parameterized queries, even if they are using a query builder, concatenation is still used.

What's the problem ?

Early on in the discussion, a couple of people requested that we allow strings and integers to be concatenated to produce a literal. When this was requested, at least one person who requested it knew that we wouldn't be able to track the source of integers.

I didn't like this idea at first, it's less than pure. Nevertheless, I spent rather a long time thinking about it, before determining that nothing dangerous can happen - if we're talking about injection - if you concatenate a string and an integer, it cannot lead to injection.

At compile time, before any user input has been provided, the engine may optimize certain concatenations and even function calls, that produce literal strings, and may contain types other than string. In other words, the engine is allowed to concatenate whatever it wants if it can determine there are no side effects and it has all of the information available to perform the concatenation (or indeed, function call) early. None of these concatenations may include user input. So it's "safe" in the narrow sense that the programmer provided all of the data, there may be a mistake in the query, but an injection is not possible.

I came around to seeing that including support for integers, even though we are not able to track their source, creates some symmetry - runtime concatenation or calls will behave the same way as the compiler and opcache with regard to string and integer values.

A wave of fear washed over internals, and some very loud people objected, and coloured the conversation.

I'm not a security expert, just a code monkey. I tried to reason with some of these people and it failed hard.

So I reached out to somebody that everybody recognizes as a security expert in the PHP ecosystem, Scott Arciszewski from Paragon Initiative Enterprises. 

I was quite ready to admit I was wrong when I asked them the question "Is it reasonable to include support for concatenating a string and an integer even when the source of the integer is unknown?"

Here's an excerpt from their response to the mailing list:

Injection attacks (SQL injection, LDAP injection, XSS, etc.) are, at
their core, an instance of type confusion between data and code. In
order for the injection to *do* anything, it needs to be in the same
input domain as the language the code is written in. Try as you might,
there is no integer that will, upon concatenation with a string,
produce a control character for HTML (i.e. `>`) or SQL (i.e. `'`).

I really thought this would help. I know that lots of the people reading internals aren't security experts, and clear words, and clear thoughts, from someone who is should help them to make good decisions.

It didn't really help, people just started to argue  ... which I found embarrassing ... 

Because of a bad naming decision (for a little while, the RFC was called is_trusted), there is this idea that if is_literal (or whatever you want to call it) returns true, the value is safe to use in all circumstances, that not only should it be free of injection, but it should be free of mistakes.

What we wanted at this point was to rebrand, we wanted to frame the thing we are introducing as the concept of Nobility. The name and idea having been suggested by Scott.

We never got to do this, but I think it would have been our best move. We get to define what nobility is, what kind of data it includes, and how they interact (or fail too, because noble) with other variables.

Instead, we had to remove the support for integers that made the feature easier to reason about and more generally useful.

Where are we now ?

Without support for integers, we are left with something that may look inconsistent if you pay close enough attention.

We cannot disable the optimizations in the compiler or opcache that lead to the production of literal values inclusive of integers (and other types). That would obviously not be an acceptable implementation detail.

So now, whether or not the engine produces a literal depends on the very fine details of how you wrote the code or performed a call. Without detailed knowledge of the engine, this makes is_literal look unpredictable and difficult to reason about.

In addition, we've broken a basic, and safe (in the narrow sense we are talking about) use case - you can no longer rely on the concatenation of a string and an integer producing a literal.

What do we do now ?

I'm not sure. 

I'm disappointed that the expert opinion I solicited did not change the direction of the conversation. If you're not going to listen to an expert in the field of security, about security things, I think you're not really going to listen to anyone, you consider yourself the expert maybe.

I would like to be free to define the concept of nobility, and I'd like people to approach that discussion armed with the expert advice we've had, and free of the notion that we are trying to protect you from mistakes in general.

I'm unsure of our next move ...

Peace out Phomies.