Posts for the month of November 2011

GLSL code snippet: choosing from 4 vectors by Z value

There aren’t many low-level GLSL optimisation resources out there, so I decided that I would share my thoughts when working on some specific parts of my code.

The basic, complete shader function

This one time I had four vec3 vectors, with an xy texture coordinate, and a weight stored in z. The code to compute the final pixel value was:

vec4 getColor(vec3 a, vec3 b, vec3 c, vec3 d)
{
    vec4 pa = texture2D(tex, a.xy) * a.z;
    vec4 pb = texture2D(tex, b.xy) * b.z;
    vec4 pc = texture2D(tex, c.xy) * c.z;
    vec4 pd = texture2D(tex, d.xy) * d.z;
    return (pa + pb + pc + pd) / (a.z + b.z + c.z + d.z);
}

That is four texture lookups, which is expensive.

The lightweight version for coarse levels of detail

If I wanted a more lightweight fragment shader, for instance when implementing variable levels of shader complexity, I would want to do only one texture lookup, and use the vector with the largest weight:

vec4 getColorFast(vec3 a, vec3 b, vec3 c, vec3 d)
{
    if (a.z < c.z) // These two tests are
        a = c;     // likely to be run in
    if (b.z < d.z) // parallel because they use
        b = d;     // independent data.
    if (a.z < b.z)
        a = b;
    return texture2D(tex, a.xy);
}

Only one texture lookup, but three branches. Branches are expensive and should be avoided.

Fortunately, GLSL provides step() and mix() (in HLSL or Cg, step() and lerp()) that let us do things similar to fsel on the PowerPC, or vec_sel in AltiVec: a branch-free select.

vec4 getColorFaster(vec3 a, vec3 b, vec3 c, vec3 d)
{
    a = mix(a, c, step(a.z, c.z)); // Again, potentially good
    b = mix(b, d, step(b.z, d.z)); // scheduling between these lines
    a = mix(a, b, step(a.z, b.z));
    return texture2D(tex, a.xy);
}

Excellent! Only six instructions in addition to the texture lookup.

But if you are familiar with SSE or AltiVec-style SIMD programming on the CPU, you will know this is not the usual way to do. Rather than 4 vectors of 3 elements, SIMD programming prefers to work in parallel on 3 vectors of 4 X, Y and Z components:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
    /* Now what do we do here? */
}

One nice thing to realise is that the equivalent of our previous step(a.z, c.z) and step(b.z, d.z) tests can be done in parallel:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
    // compare a.z >? c.z and b.z >? d.z in parallel
    vec2 t = step(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]));
    // choose between a and c using t[0], between b and d using t[1]
    vec2 twox = mix(vec2(allx[0], allx[2]), vec2(allx[1], allx[3]), t);
    vec2 twoy = mix(vec2(ally[0], ally[2]), vec2(ally[1], ally[3]), t);
    vec2 twoz = mix(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]), t);
    // compare a.z and b.z
    float s = step(twoz[0], twoz[1]);
    // now choose between a and b using s
    vec2 best = vec2(mix(twox[0], twox[1], t2), mix(twoy[0], twoy[1], s));
    return texture2D(tex, best);
}

Wow, that’s a bit complex. And even if we’re doing two calls to step() instead of three, there are now five calls to mix() instead of three. Fortunately, thanks to swizzling, we can combine most of these calls to mix():

vec4 getColorShuffledFast(vec4 allx, vec4 ally, vec4 allz)
{
    vec2 t = step(allz.ag, allz.rb);
    vec4 twoxy = mix(vec4(allx.ag, ally.ag), vec4(allx.rb, ally.rb), t.xyxy);
    vec2 twoz  = mix(allz.ag, allz.rb, t);
    float t2 = step(twoz.a, twoz.r);
    vec2 best = mix(twoxy.ag, twoxy.rb, t2);
    return texture2D(tex, best);
}

That’s it! Only three mix() and two step() instructions. Quite a few swizzles, but these are extremely cheap on modern GPUs.

Afterthoughts

The above transformation was at the “cost” of a big data layout change known as array of structures to structure of arrays. When working in parallel on similar data, it is very often a good idea, and the GPU was no exception here.

This was actually a life saver when trying to get a fallback version of a shader to work on an i915 card, where mix and step must be emulated using ALU instructions, up to a maximum of 64. The result can be seen in this NaCl plugin.

  • Posted: 2011-11-30 23:53 (Updated: 2011-12-04 11:53)
  • Author: sam
  • Categories: glsl optim
  • Comments (49)