Tuesday, 21 October 2014

Unicorn or Unicode

Fig 1. A Unicorn
Sometimes we will get an idea that would create something our limited foresight says will be beautiful. We chase after the idea, regardless of everything.

We have unicorns in programming.

This morning I want to talk briefly about the approach some of us are trying to make in adding Unicode string support to PHP7.

Unicode all the things !

PHP6 was a real thing, a bunch of effort went into it. Among other things it aimed to introduce Unicode string support at the level of the language, so that all strings were Unicode.

For many practical reasons, the project was all but abandoned. All of the features bar the Unicode string support were back ported from the 6 branch to 5.3 and released.

I'm going to assert that Unicode everywhere in PHP is a unicorn. History proves my assertion to be true, we could not overcome the performance problem inherent in treating all strings as Unicode, there was no beautiful animal.

It was absolutely worth doing, until code is released it is research, valuable research that all of us can learn from.

Some time in the future, it might be worth trying again.

Be sensible, Unicode some of the things !

PHP7 got fast, but the problem of not having any decent Unicode string support in PHP is hanging around like a bad smell. I don't think any of us want to destroy our new found performance by chasing unicorns.

So some of us have put together a wrapper around ICU's UnicodeString class. The PHP API is derived, in part from work done by Nikita Popov (@nikita_ppv).

The extension developed into something that could support backends other than ICU, there is a native Windows backend in the pipeline, for example.

It performs just enough internals magic to provide a decent iterator, sensible casts, and the ability to read dimensions.

Sure, you have to choose which strings you want to work with as Unicode strings, however, that seems like the only sensible option to me anyway. I never really needed the string passed to the include construct to be Unicode.

You can have a look at the RFC, prepared by Phil Sturgeon (@philsturgeon) here.

The extension is available here.

Let the discussion begin.