Pushing crt-pi shader optimization even further
battaglia01 last edited by battaglia01
WARNING: long post. Hoping to get @davej's attention on this one... :)
I've been trying to push the optimization on the crt-pi shader even further than what it is now. From my initial look at this thing, it seems like a lot of the work takes place in the gamma correction code. There are a couple of really easy optimizations that can be done there.
To start, let's go to the Github (https://github.com/libretro/glsl-shaders/blob/master/crt/shaders/crt-pi.glsl#L190) and look at lines 190-208.
There are a lot of macros here determining how this code works. To make things simple at first, let's look at the behavior of the simplified
FAKE_GAMMAcorrection, which means that the macros
FAKE_GAMMAare all set. Then lines 190-208 simplify to the following:
colour = colour * colour; scanLineWeight *= BLOOM_FACTOR; colour *= scanLineWeight; colour = sqrt(colour);
This code has three multiplications, a call to
sqrt, and four assignments (maybe optimized out?). We can replace the entire thing with this algebraically equivalent expression:
colour *= sqrt(scanLineWeight * BLOOM_FACTOR);
This definitely saves one multiplication per pixel. It also probably saves an unnecessary read/write to
scanLineWeight, which is never used again so we don't need to change its value. Depending on compiler optimization, it might save three more reads/writes to
By the way, note here that simplification makes clear that the
FAKE_GAMMAroutine, as currently written, does not actually do any gamma correction at all!!
colournever gets raised to a power. If we really did want "cheap gamma correction" with gamma=1.5, we could do that by changing the above to
colour *= sqrt(colour * scanLineWeight * BLOOM_FACTOR);
or for gamma=2.0, we would use
colour *= colour * sqrt(scanLineWeight * BLOOM_FACTOR);
And now we have the same number of multiplications as the original, but we actually get gamma correction, and still save a read/write to
scanLineWeightand possibly three more reads/writes to
FWIW I have used this and it works pretty well! In informal testing it seems faster than both the original gamma and the fake gamma correction, although I could use some tips on how to do more formal benchmarking.
Things get a bit more complicated for the true gamma correction, but we can obtain a comparable speedup if we're willing to tweak things just slightly.
If we instead assume
FAKE_GAMMAis not set, then lines 190-208 simplify to the following:
colour = pow(colour, vec3(INPUT_GAMMA)); scanLineWeight *= BLOOM_FACTOR; colour *= scanLineWeight; colour = pow(colour, vec3(1.0/OUTPUT_GAMMA));
This turns out to be equivalent to the following:
colour = pow(pow(colour, INPUT_GAMMA) * scanLineWeight * BLOOM_FACTOR), 1/OUTPUT_GAMMA);
So far this isn't quite as good - we've saved on the one read/write to
scanLineWeight, and maybe a couple read/writes to
colour, although compiler optimizations might render those things moot.
What really makes things much faster, so that we need only one call to
pow, is if we tweak how we weight the scanlines and BLOOM_FACTOR so that they are affected equally by both gamma correction stages. That is, right now they are being weighted by only the output gamma and not the input gamma. If we are willing to bring them inside the input gamma weighting as well, then we would instead get
colour = pow(pow(colour * scanLineWeight * BLOOM_FACTOR, INPUT_GAMMA), 1/OUTPUT_GAMMA);
which we can simplify to
colour = pow(colour * scanLineWeight * BLOOM_FACTOR, INPUT_GAMMA/OUTPUT_GAMMA);
And now we only need one call to
pow. Furthermore, at this point, we no longer even need to have separate input and output gamma settings; we can simply have one
TOTAL_GAMMAthat represents the combined exponent of the system, so that we get
colour = pow(colour * scanLineWeight * BLOOM_FACTOR, TOTAL_GAMMA);
And now we're really doing quite well. One call to pow, two multiplications, two reads, and one write. If you increase your
BLOOM_FACTORsetting you can find something exactly equivalent to the old one, and if you increase your scanline width (which will now be slightly "fatter" for the same value) you can find something that approximates the behavior of the original decently well and is much faster.
Note also that if
TOTAL_GAMMAequals exactly 1.5 or 2, we get exactly the same results as my corrected
FAKE_GAMMAsettings described previously, but with one call to
sqrtrather than a call to
It is also noteworthy that you could also just decide to take
scanLineWeightout of the
powcall, to obtain
colour = pow(colour, TOTAL_GAMMA) * scanLineWeight * BLOOM_FACTOR;
This will require us to tweak
BLOOM_FACTORin the inverse direction, and now scanlines will be "thinner" before comepnsation, but we can likewise tweak things to dial it in similarly. This should be no faster or slower than the original but may be conceptually cleaner, and we can likewise change the
FAKE_GAMMAto match this.
That is all for now - this post is pretty long and I would appreciate any thoughts if anyone has them. In particular, a lot of my testing on this has been informal and I would appreciate any tips for how to do strict benchmarking. I have spent a lot of time on this (more than I would like to admit) and I'd like to do more work really dialing this shader in.
EDIT: Also, quick add, wrote most of this up in a Github issue last year (https://github.com/libretro/glsl-shaders/issues/35), though this is much more detailed. It seems like @davej is more active responding to things on this forum than in that Github repo though.
battaglia01 last edited by battaglia01
Actually, thinking about this further, it might really be best to move the
scanLineWeight * BLOOM_FACTORout of the pow call. All this does is apply a nonlinear filter to the scanlines (why?), which should increase aliasing artifacts.
Another thought I had considered was using a quadratic Taylor series approximation to the gamma filter given by x^a ~= x*(1+(a-1)(x-1)), which works pretty well as in this quadratic approximation. A cubic approximation is even closer (but may not be necessary), as shown. But I note that someone seems to have had a similar idea here with these new "zfast" shaders, so perhaps those are already doing this sort of thing...