Pushing crtpi shader optimization even further

WARNING: long post. Hoping to get @davej's attention on this one... :)
I've been trying to push the optimization on the crtpi shader even further than what it is now. From my initial look at this thing, it seems like a lot of the work takes place in the gamma correction code. There are a couple of really easy optimizations that can be done there.
To start, let's go to the Github (https://github.com/libretro/glslshaders/blob/master/crt/shaders/crtpi.glsl#L190) and look at lines 190208.
There are a lot of macros here determining how this code works. To make things simple at first, let's look at the behavior of the simplified
FAKE_GAMMA
correction, which means that the macrosSCANLINES
,GAMMA
, andFAKE_GAMMA
are all set. Then lines 190208 simplify to the following:colour = colour * colour; scanLineWeight *= BLOOM_FACTOR; colour *= scanLineWeight; colour = sqrt(colour);
This code has three multiplications, a call to
sqrt
, and four assignments (maybe optimized out?). We can replace the entire thing with this algebraically equivalent expression:colour *= sqrt(scanLineWeight * BLOOM_FACTOR);
This definitely saves one multiplication per pixel. It also probably saves an unnecessary read/write to
scanLineWeight
, which is never used again so we don't need to change its value. Depending on compiler optimization, it might save three more reads/writes tocolour
.By the way, note here that simplification makes clear that the
FAKE_GAMMA
routine, as currently written, does not actually do any gamma correction at all!!colour
never gets raised to a power. If we really did want "cheap gamma correction" with gamma=1.5, we could do that by changing the above tocolour *= sqrt(colour * scanLineWeight * BLOOM_FACTOR);
or for gamma=2.0, we would use
colour *= colour * sqrt(scanLineWeight * BLOOM_FACTOR);
And now we have the same number of multiplications as the original, but we actually get gamma correction, and still save a read/write to
scanLineWeight
and possibly three more reads/writes tocolour
.FWIW I have used this and it works pretty well! In informal testing it seems faster than both the original gamma and the fake gamma correction, although I could use some tips on how to do more formal benchmarking.
Things get a bit more complicated for the true gamma correction, but we can obtain a comparable speedup if we're willing to tweak things just slightly.
If we instead assumeFAKE_GAMMA
is not set, then lines 190208 simplify to the following:colour = pow(colour, vec3(INPUT_GAMMA)); scanLineWeight *= BLOOM_FACTOR; colour *= scanLineWeight; colour = pow(colour, vec3(1.0/OUTPUT_GAMMA));
This turns out to be equivalent to the following:
colour = pow(pow(colour, INPUT_GAMMA) * scanLineWeight * BLOOM_FACTOR), 1/OUTPUT_GAMMA);
So far this isn't quite as good  we've saved on the one read/write to
scanLineWeight
, and maybe a couple read/writes tocolour
, although compiler optimizations might render those things moot.What really makes things much faster, so that we need only one call to
pow
, is if we tweak how we weight the scanlines and BLOOM_FACTOR so that they are affected equally by both gamma correction stages. That is, right now they are being weighted by only the output gamma and not the input gamma. If we are willing to bring them inside the input gamma weighting as well, then we would instead getcolour = pow(pow(colour * scanLineWeight * BLOOM_FACTOR, INPUT_GAMMA), 1/OUTPUT_GAMMA);
which we can simplify to
colour = pow(colour * scanLineWeight * BLOOM_FACTOR, INPUT_GAMMA/OUTPUT_GAMMA);
And now we only need one call to
pow
. Furthermore, at this point, we no longer even need to have separate input and output gamma settings; we can simply have oneTOTAL_GAMMA
that represents the combined exponent of the system, so that we getcolour = pow(colour * scanLineWeight * BLOOM_FACTOR, TOTAL_GAMMA);
And now we're really doing quite well. One call to pow, two multiplications, two reads, and one write. If you increase your
BLOOM_FACTOR
setting you can find something exactly equivalent to the old one, and if you increase your scanline width (which will now be slightly "fatter" for the same value) you can find something that approximates the behavior of the original decently well and is much faster.Note also that if
TOTAL_GAMMA
equals exactly 1.5 or 2, we get exactly the same results as my correctedFAKE_GAMMA
settings described previously, but with one call tosqrt
rather than a call topow
.It is also noteworthy that you could also just decide to take
BLOOM_FACTOR
andscanLineWeight
out of thepow
call, to obtaincolour = pow(colour, TOTAL_GAMMA) * scanLineWeight * BLOOM_FACTOR;
This will require us to tweak
BLOOM_FACTOR
in the inverse direction, and now scanlines will be "thinner" before comepnsation, but we can likewise tweak things to dial it in similarly. This should be no faster or slower than the original but may be conceptually cleaner, and we can likewise change theFAKE_GAMMA
to match this.That is all for now  this post is pretty long and I would appreciate any thoughts if anyone has them. In particular, a lot of my testing on this has been informal and I would appreciate any tips for how to do strict benchmarking. I have spent a lot of time on this (more than I would like to admit) and I'd like to do more work really dialing this shader in.
EDIT: Also, quick add, wrote most of this up in a Github issue last year (https://github.com/libretro/glslshaders/issues/35), though this is much more detailed. It seems like @davej is more active responding to things on this forum than in that Github repo though.

Actually, thinking about this further, it might really be best to move the
scanLineWeight * BLOOM_FACTOR
out of the pow call. All this does is apply a nonlinear filter to the scanlines (why?), which should increase aliasing artifacts.Another thought I had considered was using a quadratic Taylor series approximation to the gamma filter given by x^a ~= x*(1+(a1)(x1)), which works pretty well as in this quadratic approximation. A cubic approximation is even closer (but may not be necessary), as shown. But I note that someone seems to have had a similar idea here with these new "zfast" shaders, so perhaps those are already doing this sort of thing...
Contributions to the project are always appreciated, so if you would like to support us with a donation you can do so here.
Hosting provided by MythicBeasts. See the Hosting Information page for more information.