Pushing crt-pi shader optimization even further

battaglia01

WARNING: long post. Hoping to get @davej's attention on this one... :)

I've been trying to push the optimization on the crt-pi shader even further than what it is now. From my initial look at this thing, it seems like a lot of the work takes place in the gamma correction code. There are a couple of really easy optimizations that can be done there.

To start, let's go to the Github (https://github.com/libretro/glsl-shaders/blob/master/crt/shaders/crt-pi.glsl#L190) and look at lines 190-208.

There are a lot of macros here determining how this code works. To make things simple at first, let's look at the behavior of the simplified FAKE_GAMMA correction, which means that the macros SCANLINES, GAMMA, and FAKE_GAMMA are all set. Then lines 190-208 simplify to the following:

colour = colour * colour;
scanLineWeight *= BLOOM_FACTOR;
colour *= scanLineWeight;
colour = sqrt(colour);

This code has three multiplications, a call to sqrt, and four assignments (maybe optimized out?). We can replace the entire thing with this algebraically equivalent expression:

colour *= sqrt(scanLineWeight * BLOOM_FACTOR);

This definitely saves one multiplication per pixel. It also probably saves an unnecessary read/write to scanLineWeight, which is never used again so we don't need to change its value. Depending on compiler optimization, it might save three more reads/writes to colour.

By the way, note here that simplification makes clear that the FAKE_GAMMA routine, as currently written, does not actually do any gamma correction at all!! colour never gets raised to a power. If we really did want "cheap gamma correction" with gamma=1.5, we could do that by changing the above to

colour *= sqrt(colour * scanLineWeight * BLOOM_FACTOR);

or for gamma=2.0, we would use

colour *= colour * sqrt(scanLineWeight * BLOOM_FACTOR);

And now we have the same number of multiplications as the original, but we actually get gamma correction, and still save a read/write to scanLineWeight and possibly three more reads/writes to colour.

FWIW I have used this and it works pretty well! In informal testing it seems faster than both the original gamma and the fake gamma correction, although I could use some tips on how to do more formal benchmarking.

Things get a bit more complicated for the true gamma correction, but we can obtain a comparable speedup if we're willing to tweak things just slightly.
If we instead assume FAKE_GAMMA is not set, then lines 190-208 simplify to the following:

colour = pow(colour, vec3(INPUT_GAMMA));
scanLineWeight *= BLOOM_FACTOR;
colour *= scanLineWeight;
colour = pow(colour, vec3(1.0/OUTPUT_GAMMA));

This turns out to be equivalent to the following:

colour = pow(pow(colour, INPUT_GAMMA) * scanLineWeight * BLOOM_FACTOR), 1/OUTPUT_GAMMA);

So far this isn't quite as good - we've saved on the one read/write to scanLineWeight, and maybe a couple read/writes to colour, although compiler optimizations might render those things moot.

What really makes things much faster, so that we need only one call to pow, is if we tweak how we weight the scanlines and BLOOM_FACTOR so that they are affected equally by both gamma correction stages. That is, right now they are being weighted by only the output gamma and not the input gamma. If we are willing to bring them inside the input gamma weighting as well, then we would instead get

colour = pow(pow(colour * scanLineWeight * BLOOM_FACTOR, INPUT_GAMMA), 1/OUTPUT_GAMMA);

which we can simplify to

colour = pow(colour * scanLineWeight * BLOOM_FACTOR, INPUT_GAMMA/OUTPUT_GAMMA);

And now we only need one call to pow. Furthermore, at this point, we no longer even need to have separate input and output gamma settings; we can simply have one TOTAL_GAMMA that represents the combined exponent of the system, so that we get

colour = pow(colour * scanLineWeight * BLOOM_FACTOR, TOTAL_GAMMA);

And now we're really doing quite well. One call to pow, two multiplications, two reads, and one write. If you increase your BLOOM_FACTOR setting you can find something exactly equivalent to the old one, and if you increase your scanline width (which will now be slightly "fatter" for the same value) you can find something that approximates the behavior of the original decently well and is much faster.

Note also that if TOTAL_GAMMA equals exactly 1.5 or 2, we get exactly the same results as my corrected FAKE_GAMMA settings described previously, but with one call to sqrt rather than a call to pow.

It is also noteworthy that you could also just decide to take BLOOM_FACTOR and scanLineWeight out of the pow call, to obtain

colour = pow(colour, TOTAL_GAMMA) * scanLineWeight * BLOOM_FACTOR;

This will require us to tweak BLOOM_FACTOR in the inverse direction, and now scanlines will be "thinner" before comepnsation, but we can likewise tweak things to dial it in similarly. This should be no faster or slower than the original but may be conceptually cleaner, and we can likewise change the FAKE_GAMMA to match this.

That is all for now - this post is pretty long and I would appreciate any thoughts if anyone has them. In particular, a lot of my testing on this has been informal and I would appreciate any tips for how to do strict benchmarking. I have spent a lot of time on this (more than I would like to admit) and I'd like to do more work really dialing this shader in.

EDIT: Also, quick add, wrote most of this up in a Github issue last year (https://github.com/libretro/glsl-shaders/issues/35), though this is much more detailed. It seems like @davej is more active responding to things on this forum than in that Github repo though.

battaglia01

Actually, thinking about this further, it might really be best to move the scanLineWeight * BLOOM_FACTOR out of the pow call. All this does is apply a nonlinear filter to the scanlines (why?), which should increase aliasing artifacts.

Another thought I had considered was using a quadratic Taylor series approximation to the gamma filter given by x^a ~= x*(1+(a-1)(x-1)), which works pretty well as in this quadratic approximation. A cubic approximation is even closer (but may not be necessary), as shown. But I note that someone seems to have had a similar idea here with these new "zfast" shaders, so perhaps those are already doing this sort of thing...