Fast branchless RGB to HSV conversion in GLSL
Some time ago I devised an original algorithm to convert from RGB to HSV using very few CPU instructions and I wrote a small article about it.
When looking for a GLSL or HLSL conversion routine, I have found implementations of my own algorithm. However they were almost all straightforward, failing to take full advantage of the GPU’s advanced swizzling features.
So here it is, the best version I could come up with:
vec3 rgb2hsv(vec3 c) { vec4 K = vec4(0.0, -1.0 / 3.0, 2.0 / 3.0, -1.0); vec4 p = mix(vec4(c.bg, K.wz), vec4(c.gb, K.xy), step(c.b, c.g)); vec4 q = mix(vec4(p.xyw, c.r), vec4(c.r, p.yzx), step(p.x, c.r)); float d = q.x - min(q.w, q.y); float e = 1.0e-10; return vec3(abs(q.z + (q.w - q.y) / (6.0 * d + e)), d / (q.x + e), q.x); }
Update: Emil Persson suggests using the ternary operator explicitly to force compilers into using a fast conditional move instruction:
vec4 p = c.g < c.b ? vec4(c.bg, K.wz) : vec4(c.gb, K.xy); vec4 q = c.r < p.x ? vec4(p.xyw, c.r) : vec4(c.r, p.yzx);
And because a lot of people get it wrong, too, here is the reverse operation in GLSL. It is the algorithm almost everyone uses (or should use):
vec3 hsv2rgb(vec3 c) { vec4 K = vec4(1.0, 2.0 / 3.0, 1.0 / 3.0, 3.0); vec3 p = abs(fract(c.xxx + K.xyz) * 6.0 - K.www); return c.z * mix(K.xxx, clamp(p - K.xxx, 0.0, 1.0), c.y); }
Porting to HLSL is straightforward: replace vec3
and vec4
with float3
and float4
, mix
with lerp
, fract
with frac
, and clamp(…, 0.0, 1.0)
with saturate(…)
.
GLSL code snippet: choosing from 4 vectors by Z value
There aren’t many low-level GLSL optimisation resources out there, so I decided that I would share my thoughts when working on some specific parts of my code.
The basic, complete shader function
This one time I had four vec3
vectors, with an xy
texture coordinate, and a weight stored in z
. The code to compute the final pixel value was:
vec4 getColor(vec3 a, vec3 b, vec3 c, vec3 d) { vec4 pa = texture2D(tex, a.xy) * a.z; vec4 pb = texture2D(tex, b.xy) * b.z; vec4 pc = texture2D(tex, c.xy) * c.z; vec4 pd = texture2D(tex, d.xy) * d.z; return (pa + pb + pc + pd) / (a.z + b.z + c.z + d.z); }
That is four texture lookups, which is expensive.
The lightweight version for coarse levels of detail
If I wanted a more lightweight fragment shader, for instance when implementing variable levels of shader complexity, I would want to do only one texture lookup, and use the vector with the largest weight:
vec4 getColorFast(vec3 a, vec3 b, vec3 c, vec3 d) { if (a.z < c.z) // These two tests are a = c; // likely to be run in if (b.z < d.z) // parallel because they use b = d; // independent data. if (a.z < b.z) a = b; return texture2D(tex, a.xy); }
Only one texture lookup, but three branches. Branches are expensive and should be avoided.
Fortunately, GLSL provides step()
and mix()
(in HLSL or Cg, step()
and lerp()
) that let us do things similar to fsel
on the PowerPC, or vec_sel
in AltiVec: a branch-free select.
vec4 getColorFaster(vec3 a, vec3 b, vec3 c, vec3 d) { a = mix(a, c, step(a.z, c.z)); // Again, potentially good b = mix(b, d, step(b.z, d.z)); // scheduling between these lines a = mix(a, b, step(a.z, b.z)); return texture2D(tex, a.xy); }
Excellent! Only six instructions in addition to the texture lookup.
But if you are familiar with SSE or AltiVec-style SIMD programming on the CPU, you will know this is not the usual way to do. Rather than 4 vectors of 3 elements, SIMD programming prefers to work in parallel on 3 vectors of 4 X
, Y
and Z
components:
vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz) { /* Now what do we do here? */ }
One nice thing to realise is that the equivalent of our previous step(a.z, c.z)
and step(b.z, d.z)
tests can be done in parallel:
vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz) { // compare a.z >? c.z and b.z >? d.z in parallel vec2 t = step(vec2(allz[0], allz[2]), vec2(allz[1], allz[3])); // choose between a and c using t[0], between b and d using t[1] vec2 twox = mix(vec2(allx[0], allx[2]), vec2(allx[1], allx[3]), t); vec2 twoy = mix(vec2(ally[0], ally[2]), vec2(ally[1], ally[3]), t); vec2 twoz = mix(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]), t); // compare a.z and b.z float s = step(twoz[0], twoz[1]); // now choose between a and b using s vec2 best = vec2(mix(twox[0], twox[1], t2), mix(twoy[0], twoy[1], s)); return texture2D(tex, best); }
Wow, that’s a bit complex. And even if we’re doing two calls to step()
instead of three, there are now five calls to mix()
instead of three. Fortunately, thanks to swizzling, we can combine most of these calls to mix()
:
vec4 getColorShuffledFast(vec4 allx, vec4 ally, vec4 allz) { vec2 t = step(allz.ag, allz.rb); vec4 twoxy = mix(vec4(allx.ag, ally.ag), vec4(allx.rb, ally.rb), t.xyxy); vec2 twoz = mix(allz.ag, allz.rb, t); float t2 = step(twoz.a, twoz.r); vec2 best = mix(twoxy.ag, twoxy.rb, t2); return texture2D(tex, best); }
That’s it! Only three mix()
and two step()
instructions. Quite a few swizzles, but these are extremely cheap on modern GPUs.
Afterthoughts
The above transformation was at the “cost” of a big data layout change known as array of structures to structure of arrays. When working in parallel on similar data, it is very often a good idea, and the GPU was no exception here.
This was actually a life saver when trying to get a fallback version of a shader to work on an i915 card, where mix
and step
must be emulated using ALU instructions, up to a maximum of 64. The result can be seen in this NaCl plugin.