# GLSL code snippet: choosing from 4 vectors by Z value

There aren’t many low-level GLSL optimisation resources out there, so I decided that I would share my thoughts when working on some specific parts of my code.

## The basic, complete shader function

This one time I had four `vec3` vectors, with an `xy` texture coordinate, and a weight stored in `z`. The code to compute the final pixel value was:

vec4 getColor(vec3 a, vec3 b, vec3 c, vec3 d) { vec4 pa = texture2D(tex, a.xy) * a.z; vec4 pb = texture2D(tex, b.xy) * b.z; vec4 pc = texture2D(tex, c.xy) * c.z; vec4 pd = texture2D(tex, d.xy) * d.z; return (pa + pb + pc + pd) / (a.z + b.z + c.z + d.z); }

That is four texture lookups, which is expensive.

## The lightweight version for coarse levels of detail

If I wanted a more lightweight fragment shader, for instance when implementing **variable levels of shader complexity**, I would want to do only one texture lookup, and use the vector with the largest weight:

vec4 getColorFast(vec3 a, vec3 b, vec3 c, vec3 d) { if (a.z < c.z) // These two tests are a = c; // likely to be run in if (b.z < d.z) // parallel because they use b = d; // independent data. if (a.z < b.z) a = b; return texture2D(tex, a.xy); }

Only one texture lookup, but three branches. Branches are expensive and should be avoided.

Fortunately, GLSL provides `step()` and `mix()` (in HLSL or Cg, `step()` and `lerp()`) that let us do things similar to `fsel` on the PowerPC, or `vec_sel` in AltiVec: a branch-free select.

vec4 getColorFaster(vec3 a, vec3 b, vec3 c, vec3 d) { a = mix(a, c, step(a.z, c.z)); // Again, potentially good b = mix(b, d, step(b.z, d.z)); // scheduling between these lines a = mix(a, b, step(a.z, b.z)); return texture2D(tex, a.xy); }

Excellent! Only six instructions in addition to the texture lookup.

But if you are familiar with SSE or AltiVec-style SIMD programming on the CPU, you will know this is not the usual way to do. Rather than 4 vectors of 3 elements, SIMD programming prefers to work in parallel on 3 vectors of 4 `X`, `Y` and `Z` components:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz) { /* Now what do we do here? */ }

One nice thing to realise is that the equivalent of our previous `step(a.z, c.z)` and `step(b.z, d.z)` tests can be done in parallel:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz) { // compare a.z >? c.z and b.z >? d.z in parallel vec2 t = step(vec2(allz[0], allz[2]), vec2(allz[1], allz[3])); // choose between a and c using t[0], between b and d using t[1] vec2 twox = mix(vec2(allx[0], allx[2]), vec2(allx[1], allx[3]), t); vec2 twoy = mix(vec2(ally[0], ally[2]), vec2(ally[1], ally[3]), t); vec2 twoz = mix(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]), t); // compare a.z and b.z float s = step(twoz[0], twoz[1]); // now choose between a and b using s vec2 best = vec2(mix(twox[0], twox[1], t2), mix(twoy[0], twoy[1], s)); return texture2D(tex, best); }

Wow, that’s a bit complex. And even if we’re doing two calls to `step()` instead of three, there are now five calls to `mix()` instead of three. Fortunately, thanks to swizzling, we can combine most of these calls to `mix()`:

vec4 getColorShuffledFast(vec4 allx, vec4 ally, vec4 allz) { vec2 t = step(allz.ag, allz.rb); vec4 twoxy = mix(vec4(allx.ag, ally.ag), vec4(allx.rb, ally.rb), t.xyxy); vec2 twoz = mix(allz.ag, allz.rb, t); float t2 = step(twoz.a, twoz.r); vec2 best = mix(twoxy.ag, twoxy.rb, t2); return texture2D(tex, best); }

That’s it! Only three `mix()` and two `step()` instructions. Quite a few swizzles, but these are extremely cheap on modern GPUs.

## Afterthoughts

The above transformation was at the “cost” of a big data layout change known as **array of structures to structure of arrays**. When working in parallel on similar data, it is very often a good idea, and the GPU was no exception here.

This was actually a life saver when trying to get a fallback version of a shader to work on an i915 card, where `mix` and `step` must be emulated using ALU instructions, up to a maximum of 64. The result can be seen in this NaCl plugin.