Posts by author sam

Damping with delta-time

Just a quick tip on how to convert usual damping code to something framerate-independent.

Most of us have probably, at some point, written code resembling this:

// Perform velocity damping
velocity -= velocity * 0.01f;

… or probably the more correct:

// Per-second damping coefficient
float const D = 10.0f;

// Damp velocity according to timestep
velocity -= velocity * D * delta_time;

Yet this is not fully framerate-independent; results are slightly different at 30fps and 60fps, and more importantly, spikes in the framerate cause lots of weird artifacts, causing developers to attempt to fix the situation by clamping delta_time, which is not ideal.

The exponentiation method

Here is one way to fix it: assume that the code works correctly at 60 fps. This means that each frame, velocity is effectively multiplied by 1 - D / 60.

After one second, i.e. 60 frames, velocity has been multiplied by (1 - D / 60) ^ 60.

After two seconds, it has been multiplied by (1 - D / 60) ^ (60 * 2).

After N seconds, it has been multiplied by (1 - D / 60) ^ (60 * N).

So, there, we have a formula that tells us what happens after N seconds, and it’s a continuous function. We can therefore choose N as we like, and especially N = delta_time:

// Per-second damping coefficient
float const D = 10.0f;

// Damp velocity (framerate-independent)
velocity *= pow(1.f - D / 60.f, 60.f * delta_time);

Which can be conveniently rewritten as:

// Per-second damping coefficient
float const D = 10.0f;
// Exponentiation base for velocity damping
float const D2 = pow(1.f - D / 60.f, 60.f);

// Damp velocity (framerate-independent)
velocity *= pow(D2, delta_time);

Use with lerp

The same method can be adapted to uses of linear interpolation such as this one:

// Perform velocity damping
velocity = lerp(velocity, target_velocity, K * delta_time);

Which we replace with:

// Damp velocity (framerate-independent)
velocity = lerp(velocity, target_velocity,
                1.f - pow(1.f - K / 60.f, 60.f * delta_time));

Quaternion from two vectors: the final version

I have published a first article about how to build a quaternion from two arbitrary direction vectors that will transform one of the directions into the other. That article was deliberately omitting the special case when the two vectors were facing away from each other, which required special treatment. So I wrote a second article explaining how to quickly pick an orthogonal vector.

Combining the results from the two articles should be straightforward. Here is the resulting, final function:

/* Build a unit quaternion representing the rotation
 * from u to v. The input vectors need not be normalised. */
quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float norm_u_norm_v = sqrt(dot(u, u) * dot(v, v));
    float real_part = norm_u_norm_v + dot(u, v);
    vec3 w;

    if (real_part < 1.e-6f * norm_u_norm_v)
    {
        /* If u and v are exactly opposite, rotate 180 degrees
         * around an arbitrary orthogonal axis. Axis normalisation
         * can happen later, when we normalise the quaternion. */
        real_part = 0.0f;
        w = abs(u.x) > abs(u.z) ? vec3(-u.y, u.x, 0.f)
                                : vec3(0.f, -u.z, u.y);
    }
    else
    {
        /* Otherwise, build quaternion the standard way. */
        w = cross(u, v);
    }

    return normalize(quat(real_part, w.x, w.y, w.z));
}

If you know you are exclusively dealing with unit vectors, you can replace all occurrences of norm_u_norm_v with the value 1.0f in order to avoid a useless square root.

On picking an orthogonal vector (and combing coconuts)

Consider the following problem:

Given a non-zero vector v in 3D space, find a non-zero vector w orthogonal to v (i.e. such that v.w = 0).

At first sight, this seems easy. For every possible v there are an infinite number of orthogonal vectors lying in a plane:

We just need to pick one. But it’s not as trivial as it may seem.

First attempts

The following vector satisfies v.w1 = 0:

$$\mathbf{v} = \begin{bmatrix}x \\ y \\ z \end{bmatrix}
\quad
\mathbf{w_1} = \begin{bmatrix}-y \\ x \\ 0 \end{bmatrix}
\quad
\mathbf{v} . \mathbf{w_1} = x * -y + y * x + z * 0 = 0
$$

Unfortunately that w1 vector can be zero, for instance when v = [0 0 1]. The original problem requires a non-zero vector.

Let’s try another one:

$$\mathbf{v} = \begin{bmatrix}x \\ y \\ z \end{bmatrix}
\quad
\mathbf{w_2} = \begin{bmatrix}0 \\ -z \\ y \end{bmatrix}
\quad
\mathbf{v} . \mathbf{w_2} = x * 0 + y * -z + z * y = 0
$$

Again, it works most of the time, except in some cases, for instance v = [1 0 0].

How about adding them together? Certainly w1 and w2 cannot be both zero at the same time!

$$\mathbf{v} = \begin{bmatrix}x \\ y \\ z \end{bmatrix}
\quad
\mathbf{w_3} = \begin{bmatrix}-y \\ x-z \\ y \end{bmatrix}
\quad
\mathbf{v} . \mathbf{w_3} = x * -y + y * (x-z) + z * y = 0
$$

Alas, this still doesn’t work all the time: it fails with v = ![1 0 1].

Should we try harder? Isn’t there a linear combination of these that will work all the time?

Well, no. There isn’t.

You can’t comb a coconut

Sadly, the hairy ball theorem, a consequence of Brouwer’s fixed-point theorem, states that any continuous function f that maps the set of unit vectors to orthogonal vectors takes the value zero somewhere.

Whichever way you look at the sphere of unit vectors, a continuous tangent field will always have a zero point:

Most of the maths operations we usually do (adding, multiplying, taking square roots, etc.) are continuous. Fortunately, we are not limited to them: we can use a simple if() construct to create a discontinuity that will save us.

A very good implementation

Here is my personal implementation of the orthogonal vector problem. It is very good. It has excellent numerical stability. It only does one test. It is short. It is beautiful. I really hope you consider using it.

/* Always works if the input is non-zero.
 * Doesn’t require the input to be normalised.
 * Doesn’t normalise the output. */
vec3 orthogonal(vec3 v)
{
    return abs(v.x) > abs(v.z) ? vec3(-v.y, v.x, 0.0)
                               : vec3(0.0, -v.z, v.y);
}

Many implementations, such as Ogre3D’s, have additional tests or operations to perform this task, and use epsilons and normalised vectors to avoid singularities.

But our only test is actually enough: if |x|>|z|, it means the largest component in absolute value in v is either x or y. So we build a vector using x and y. Otherwise, the largest component in absolute value is either y or z. So we build a vector using y and z. That way, we ensure that the length of our returned vector never, ever comes near zero.

Whether some given code will cause inefficient branching is often unclear. In our case, it is very likely that the ternary operator will actually be branchless, with some help of the compiler and the hardware.

That said, how about we try to get rid of the ternary operator, just for fun?

Going branchless for fun

Let’s see. We had these two candidate vectors, w1 and w2, which worked almost always except when specific values of v caused them to be zero. And whatever the constant k we may pick, a vector of the form w1 + k w2 will eventually take the value zero for some value of v, too, because of the hairy ball theorem.

Now here comes the trick. Instead of using a constant k, we use a function f(x,y,z). This is what our w vector looks like:

$$\mathbf{v} = \begin{bmatrix}x \\ y \\ z \end{bmatrix}
\quad
\mathbf{w} = \begin{bmatrix}-y \\ x- z * f(x,y,z) \\ y * f(x,y,z) \end{bmatrix}
\quad
\mathbf{v} . \mathbf{w} = \cdots = 0
$$

From now I shall assume that v is a unit vector.

What can cause w to be zero, then?

One necessary condition is y = 0. When y ≠ 0 we can chose anything we want for f(x,y,z), it’ll always give us a good orthogonal vector. This restricts the problem to the y = 0 circle, giving us the useful equality x² + z² = 1.

The other condition is x = z*f(x,y,z). So if we manage to build a function f such that f(x,y,z) never equals x/z, we win.

Using x² + z² = 1 we can plot all the possible values for x/z as a function of x. It will show us, for a given x, the values that f cannot take:

The almost vertical slopes you see go to infinity upwards and downwards. As expected, this prevents us from using a continuous function: it would obviously cross one of these walls at some point.

Well, let’s try a non-continuous function, then. What are our options?

  • fmod
  • floor, ceil, round
  • fract

Here is one that I came up with and which works reasonably well:

$$f(x,y,z) = \mathrm{fract}(|x| + \frac{1}{2})$$

Look how it nicely avoids x/z:

And finally, our resulting branchless code:

/* Requires the input to be normalised.
 * Doesn’t normalise the output. */
vec3 orthogonal(vec3 v)
{
    float k = fract(abs(v.x) + 0.5);
    return vec3(-v.y, v.x - k * v.z, k * v.y);
}

I find it highly unlikely that this second version will perform better than the branching one. However, I haven’t put much thought into it and maybe someone will come up with a much better solution using the ideas presented here. Have fun!

  • Posted: 2013-09-21 11:25 (Updated: 2013-09-21 11:34)
  • Author: sam
  • Categories: maths optim
  • Comments (726)

Beautiful maths simplification: quaternion from two vectors

In this article I would like to explain the thought process behind the derivation of a widely used formula: finding a quaternion representing the rotation between two 3D vectors. Nothing really new, but hopefully a few ideas could be reused at other times.

Note: the routine presented here is incomplete on purpose. For a version that can be used in production code, see the next article instead, Quaternion from two vectors: the final version.

Naive method

A rotation is best visualised using a rotation axis and an angle. Except in degenerate cases, the rotation axis can be obtained by computing the cross product of the two original vectors:

Then the angle can be obtained using the properties of the cross product and/or the dot product:

\begin{align}
\boldsymbol{u} . \boldsymbol{v} &= |\boldsymbol{u}| . |\boldsymbol{v}| . \cos\theta \\
||\boldsymbol{u} \times \boldsymbol{v}|| &= |\boldsymbol{u}| . |\boldsymbol{v}| . |\sin\theta|
\end{align}

Since θ is always between 0 and π, we only care about the dot product. This gives us some obvious code to create the quaternion (omitting corner cases such as θ = 0 for clarity):

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float cos_theta = dot(normalize(u), normalize(v));
    float angle = acos(cos_theta);
    vec3 w = normalize(cross(u, v));
    return quat::fromaxisangle(angle, w);
}

This is naive but it works. Googling for “quaternion from two vectors” shows this forum post and this SO question where this method or a variation thereof is suggested.

Looking under the hood

Let’s have a look at what happens when building the quaternion from an axis and an angle:

$$ q = \cos\frac{\theta}{2} + \boldsymbol{i}\sin\frac{\theta}{2}v_x + \boldsymbol{j}\sin\frac{\theta}{2}v_y + \boldsymbol{k}\sin\frac{\theta}{2}v_z $$

This means the code for quat::fromaxisangle would look somewhat like this:

quat quat::fromaxisangle(float angle, vec3 axis)
{
    float half_sin = sin(0.5f * angle);
    float half_cos = cos(0.5f * angle);
    return quat(half_cos,
                half_sin * axis.x,
                half_sin * axis.y,
                half_sin * axis.z);
}

Avoiding trigonometry

If you read Iñigo Quilez’s recent article about avoiding trigonometry you’ll have probably frowned at the fact that we computed θ from cos(θ), then computed sin(θ/2) and cos(θ/2).

Indeed, it happens that there is a much simpler way to do it; the half-angle formulas from precalculus tell us the following:

\begin{align}
\sin\frac{\theta}{2} &= \sqrt{\frac{1 - \cos\theta}{2}} \\
\cos\frac{\theta}{2} &= \sqrt{\frac{1 + \cos\theta}{2}}
\end{align}

This allows us to simplify our quaternion creation code:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float cos_theta = dot(normalize(u), normalize(v));
    float half_cos = sqrt(0.5f * (1.f + cos_theta));
    float half_sin = sqrt(0.5f * (1.f - cos_theta));
    vec3 w = normalize(cross(u, v));
    return quat(half_cos,
                half_sin * w.x,
                half_sin * w.y,
                half_sin * w.z);
}

This is pretty nice. By using well known trigonometry formulas, we got rid of all trigonometry function calls!

Avoiding square roots

It happens that we can do slightly better. Note that we normalize three vectors: u, v and cross(u, v). That’s three square roots. The thing is, we already know the norm of w through this formula:

$$||\boldsymbol{u} \times \boldsymbol{v}|| = |\boldsymbol{u}| . |\boldsymbol{v}| . |\sin\theta|$$

And we know sin(θ) from precalculus again:

$$\sin\theta = 2 \sin\frac{\theta}{2} \cos\frac{\theta}{2}$$

Also, using the fact that sqrt(a)sqrt(b) = sqrt(ab) lets us perform one less square root.

We can therefore come up with the following performance improvement:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float norm_u_norm_v = sqrt(sqlength(u) * sqlength(v));
    float cos_theta = dot(u, v) / norm_u_norm_v;
    float half_cos = sqrt(0.5f * (1.f + cos_theta));
    float half_sin = sqrt(0.5f * (1.f - cos_theta));
    vec3 w = cross(u, v) / (norm_u_norm_v * 2.f * half_sin * half_cos);
    return quat(half_cos,
                half_sin * w.x,
                half_sin * w.y,
                half_sin * w.z);
}

Oh wait! We divide by sin(θ/2) to compute w, then we multiply by sin(θ/2) again. This means we don’t even need that variable, and we can simplify even further:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float norm_u_norm_v = sqrt(sqlength(u) * sqlength(v));
    float cos_theta = dot(u, v) / norm_u_norm_v;
    float half_cos = sqrt(0.5f * (1.f + cos_theta));
    vec3 w = cross(u, v) / (norm_u_norm_v * 2.f * half_cos);
    return quat(half_cos, w.x, w.y, w.z);
}

This is more or less the code used by the Ogre3D engine in OgreVector3.h, except they perform an additional normalisation step on the final result. This is mathematically useless, but due to numerical stability issues, it is probably safe to do so nonetheless.

This final normalisation step is actually an opportunity to simplify the code even further.

Improving on Ogre3D

We are down to two square roots and four divisions, plus quite a few mul/adds. Depending on the platform that we are running on, it is possible to simplify even further and improve performance. For instance, on many SIMD architectures, normalising a quaternion can be very fast.

This is the code we get if we multiply every component of the quaternion by 2.f * half_cos and let normalize() do the rest of the job:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float norm_u_norm_v = sqrt(sqlength(u) * sqlength(v));
    float cos_theta = dot(u, v) / norm_u_norm_v;
    float half_cos = sqrt(0.5f * (1.f + cos_theta));
    vec3 w = cross(u, v) / norm_u_norm_v;
    return normalize(quat(2.f * half_cos * half_cos, w.x, w.y, w.z));
}

Now half_cos only appears in its squared form, and since it comes from a square root, we can simply omit that square root:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float norm_u_norm_v = sqrt(sqlength(u) * sqlength(v));
    float cos_theta = dot(u, v) / norm_u_norm_v;
    vec3 w = cross(u, v) / norm_u_norm_v;
    return normalize(quat(1.f + cos_theta, w.x, w.y, w.z));
}

And using the same reasoning we can multiply every quaternion component by norm_u_norm_v:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float norm_u_norm_v = sqrt(sqlength(u) * sqlength(v));
    vec3 w = cross(u, v);
    quat q = quat(norm_u_norm_v + dot(u, v), w.x, w.y, w.z);
    return normalize(q);
}

We are still doing two square roots and four divisions, some of which are hidden in normalize(), but the code is considerably shorter now.

Final form

If u and v can be enforced to be unit vectors, norm_u_norm_v can be omitted and simply replaced with 1.0f:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    vec3 w = cross(u, v);
    quat q = quat(1.f + dot(u, v), w.x, w.y, w.z);
    return normalize(q);
}

Isn’t it beautiful, considering the sin(), cos() and acos() ridden mess we started with?

This algorithm can be found all over the Internet, but I do not know who first came up with it. Also, a lot of 3D engines (both publicly available and slightly more private) could benefit from it.

Update (06/01/2014)

In the comments below, Michael Norel provides the following improvement to the non-unit version. Since the values d = dot(u, v) and w = cross(u, v) are computed no matter what, the value sqlength(u) * sqlength(v) could be computed in a different way, *i.e.* d * d + sqlength(w). The following code does at least three multiplications less:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    vec3 w = cross(u, v);
    quat q = quat(dot(u, v), w.x, w.y, w.z);
    q.w += length(q);
    return normalize(q);
}

Also, Marc B. Reynolds notes that, in the unit version, the final normalisation factor is sqrt((1 + dot(u, v))² + sqlength(cross(u, v))) which reduces to sqrt(2 + 2 dot(u, v)) thanks to the sin² + cos² identity. It leads to the following possibly improved version:

quat quat::fromtwovectors(vec3 u, vec3 v)
{
    float m = sqrt(2.f + 2.f * dot(u, v));
    vec3 w = (1.f / m) * cross(u, v);
    return quat(0.5f * m, w.x, w.y, w.z);
}
  • Posted: 2013-09-18 18:13 (Updated: 2014-06-01 11:32)
  • Author: sam
  • Categories: maths optim
  • Comments (409)

Fast branchless RGB to HSV conversion in GLSL

Some time ago I devised an original algorithm to convert from RGB to HSV using very few CPU instructions and I wrote a small article about it.

When looking for a GLSL or HLSL conversion routine, I have found implementations of my own algorithm. However they were almost all straightforward, failing to take full advantage of the GPU’s advanced swizzling features.

So here it is, the best version I could come up with:

vec3 rgb2hsv(vec3 c)
{
    vec4 K = vec4(0.0, -1.0 / 3.0, 2.0 / 3.0, -1.0);
    vec4 p = mix(vec4(c.bg, K.wz), vec4(c.gb, K.xy), step(c.b, c.g));
    vec4 q = mix(vec4(p.xyw, c.r), vec4(c.r, p.yzx), step(p.x, c.r));

    float d = q.x - min(q.w, q.y);
    float e = 1.0e-10;
    return vec3(abs(q.z + (q.w - q.y) / (6.0 * d + e)), d / (q.x + e), q.x);
}

Update: Emil Persson suggests using the ternary operator explicitly to force compilers into using a fast conditional move instruction:

    vec4 p = c.g < c.b ? vec4(c.bg, K.wz) : vec4(c.gb, K.xy);
    vec4 q = c.r < p.x ? vec4(p.xyw, c.r) : vec4(c.r, p.yzx);

And because a lot of people get it wrong, too, here is the reverse operation in GLSL. It is the algorithm almost everyone uses (or should use):

vec3 hsv2rgb(vec3 c)
{
    vec4 K = vec4(1.0, 2.0 / 3.0, 1.0 / 3.0, 3.0);
    vec3 p = abs(fract(c.xxx + K.xyz) * 6.0 - K.www);
    return c.z * mix(K.xxx, clamp(p - K.xxx, 0.0, 1.0), c.y);
}

Porting to HLSL is straightforward: replace vec3 and vec4 with float3 and float4, mix with lerp, fract with frac, and clamp(…, 0.0, 1.0) with saturate(…).

  • Posted: 2013-07-27 15:23 (Updated: 2013-07-28 11:54)
  • Author: sam
  • Categories: glsl optim
  • Comments (704)

A fast RGB to HSV floating point conversion

The operations typically performed to convert from RGB to HSV are the following:

  • find the largest RGB component
  • find the smallest RGB component
  • compute V and S
  • select the main circular sector for H
  • compute H

Here is, to my knowledge, the most commonly used RGB to HSV routine for floating point, with an extra minor optimisation (adding 1e-20f to divisors to avoid the need to care about divisions by zero):

static void RGB2HSV(float r, float g, float b,
                    float &h, float &s, float &v)
{
    float rgb_max = std::max(r, std::max(g, b));
    float rgb_min = std::min(r, std::min(g, b));
    float delta = rgb_max - rgb_min;
    s = delta / (rgb_max + 1e-20f);
    v = rgb_max;

    float hue;
    if (r == rgb_max)
        hue = (g - b) / (delta + 1e-20f);
    else if (g == rgb_max)
        hue = 2 + (b - r) / (delta + 1e-20f);
    else
        hue = 4 + (r - g) / (delta + 1e-20f);
    if (hue < 0)
        hue += 6.f;
    h = hue * (1.f / 6.f);
}

Several things seem worth noticing already:

  • Most of the complexity comes from the hue calculation.
  • Four min/max operations are performed to find rgb_max and rgb_min; however, sorting three values can be done with only 3 comparisons. This is not necessarily problematic because min/max could be wired in an efficient way depending on the CPU.
  • Two additional tests are performed to compare r and g to rgb_max; if rgb_max and rgb_min were computed using tests, this is a waste of time to compare them again.
  • Adding 6.f to the final hue value only has a 16.6% chance of happening.

The actual hue calculation depends on how r, g, and b are ordered:

$\operatorname{Hue_{0\dots 6}}(r,g,b)=\begin{cases}
    (g - b) / (r - b), & \text{if $r \ge g \ge b$}.\\
    6 + (g - b) / (r - g), & \text{if $r \ge b \ge g$}.\\
    2 + (b - r) / (g - r), & \text{if $g \ge b \ge r$}.\\
    2 + (b - r) / (g - b), & \text{if $g \ge r \ge b$}.\\
    4 + (r - g) / (b - g), & \text{if $b \ge r \ge g$}.\\
    4 + (r - g) / (b - r), & \text{if $b \ge g \ge r$}.\\
  \end{cases}$

But let’s rewrite this in terms of x, y and z, where x is the largest of (r,g,b), z is the smallest of the three, and y is inbetween:

$\operatorname{Hue_{0\dots 6}}(R,G,B)=\begin{cases}
    (y - z) / (x - z), & \text{if $r \ge g \ge b$}.\\
    6 + (z - y) / (x - z), & \text{if $r \ge b \ge g$}.\\
    2 + (y - z) / (x - z), & \text{if $g \ge b \ge r$}.\\
    2 + (z - y) / (x - z), & \text{if $g \ge r \ge b$}.\\
    4 + (y - z) / (x - z), & \text{if $b \ge r \ge g$}.\\
    4 + (z - y) / (x - z), & \text{if $b \ge b \ge r$}.\\
  \end{cases}$

There are a lot of similarities here. We can push it even further, using the fact that x ≥ z and y ≥ z by definition:

$\operatorname{Hue_{0\dots 6}}(R,G,B)=\left|K + \dfrac{y - z}{x - z}\right|,
 \text{with $K =\begin{cases}
    0, & \text{if $r \ge g \ge b$}.\\
    -6, & \text{if $r \ge b \ge g$}.\\
    2, & \text{if $g \ge b \ge r$}.\\
    -2, & \text{if $g \ge r \ge b$}.\\
    4, & \text{if $b \ge r \ge g$}.\\
    -4, & \text{if $b \ge b \ge r$}.\\
  \end{cases}$}$

That’s actually the same calculation! Only the hue offset K changes. The idea now is the following:

  • Sort the triplet (r,g,b) using comparisons
  • Build K while sorting the triplet
  • Perform the final calculation

Putting the idea into practice gives us the following code:

static void RGB2HSV(float r, float g, float b,
                    float &h, float &s, float &v)
{
    float K = 0.f;

    if (g < b)
    {
        float tmp = g; g = b; b = tmp;
        K = -1.f;
    }

    if (r < g)
    {
        float tmp = r; r = g; g = tmp;
        K = -2.f / 6.f - K;
    }

    if (g < b)
    {
        float tmp = g; g = b; b = tmp;
        K = -K;
    }

    float chroma = r - b;
    h = fabs(K + (g - b) / (6.f * chroma + 1e-20f));
    s = chroma / (r + 1e-20f);
    v = r;
}

You can check for yourself that the values for K explicited above are properly generated by that function. There were many other ways to sort (r,g,b) but this specific one lets us do one final optimisation.

We notice that the last swap effectively changes the sign of K and the sign of g - b. Since both are then added and passed to fabs(), the sign reversal can actually be omitted.

That additional trickery gives us this final code:

static void RGB2HSV(float r, float g, float b,
                    float &h, float &s, float &v)
{
    float K = 0.f;

    if (g < b)
    {
        std::swap(g, b);
        K = -1.f;
    }

    if (r < g)
    {
        std::swap(r, g);
        K = -2.f / 6.f - K;
    }

    float chroma = r - std::min(g, b);
    h = fabs(K + (g - b) / (6.f * chroma + 1e-20f));
    s = chroma / (r + 1e-20f);
    v = r;
}

That’s 2 tests and 1 std::min call instead of the previous 3 tests and 4 std::min/max calls. We really should see some kind of performance gain here.

And as expected, benchmarks indicate a performance increase of 25 to 40 % with a great variety of CPUs, compilers and compiler flags. The following graph (average nanoseconds per conversion) is on a Core i7-2600K CPU, using g++ 4.7.2 with -O3 -ffast-math:

  • Posted: 2013-01-13 06:19 (Updated: 2013-02-11 13:48)
  • Author: sam
  • Categories: optim c++
  • Comments (823)

Announce: VsLol 1.0.0.8 released

Version 1.0.0.8 of VsLol has been released. It fixes an unfortunate crash that could be triggered when IntelliSense tried to display an autocompletion menu for a constructor call.

You can learn more about VsLol on its homepage or on the Visual Studio Gallery.

Announce: VsLol 1.0.0.5 released

Following my previous rant about the sorry state of syntax highlighting in Visual Studio 2010, I have decided to make my small add-in public.

Long story short, the add-in fixes the following issues in the code editor:

You can learn more about VsLol on its homepage or on the Visual Studio Gallery.

Update: I have been informed that VsLol works properly on Visual Studio 2012 and fixes the same issues; I have therefore uploaded version 1.0.0.7 and activated the VS2012 flags.

Fuck you, Microsoft: the sorry state of Visual Studio syntax highlighting

TL;DR: you can’t even choose a different colour for “return” and “float” in Visual Studio.

I mostly use Vim as my everyday editor. This is not the place to go through the details of why, but basically it does exactly what I want. One thing Vim does well for me is syntax highlighting. Here is a bit of Lol Engine C++ using Vim’s default colour scheme:

The colour scheme is simple here:

  • yellow for control flow
  • green for types and qualifiers
  • magenta for constants
  • light grey for everything else

You may notice that the vec3 or quat types, which are not C++ base types but which behave exactly as if, are coloured just like float. This is simply done by adding the following line to my main config file or to a separate configuration file:

au Syntax cpp syn keyword cType vec2 vec3 vec4 quat

Okay now I would like you to read that again.

  • yellow for control flow
  • green for types (including custom types) and qualifiers
  • I get all that shit by adding one single configuration line

And I am going to show you the pain it is to do the same in Visual Studio.

First Visual Studio attempt: usertype.dat

Since at least Visual Studio 2003, the usertype.dat file can be used to add a list of user types. This is still true in Visual Studio 2010. So let’s add those new types to usertype.dat:

vec2
vec3
vec4
quat

Not quite there.

M_PI is not coloured, and if I add it to the list it becomes green instead of magenta, but let’s forget about it for now.

float and const are still yellow and there is no way to fix that! Of course I thought about adding them to the list of user types, but the scanner decides that they’re C++ keywords way before it checks for user types.

Second Visual Studio attempt: a Language Service add-in

Visual Studio add-ins are very powerful. Almost any part of the editor can be modified and augmented. There are hundreds of add-ins available and you can write your own.

This looks promising: you can write your own language handler, but you can certainly modify an existing one quite easily (or so I thought).

To handle a new language, you need to subclass the LanguageService class and you register that new class in Visual Studio. One of the methods you need to implement in the language service is GetScanner:

class MyLanguageService : LanguageService
{
    [...]

    public override IScanner GetScanner(IVsTextLines buffer)
    {
        return new MyScanner(buffer);
    }
}

And the IScanner interface has a ScanTokenAndProvideInfoAboutIt method that is responsible for choosing the colour of tokens:

class MyScanner : IScanner
{
    [...]

    bool IScanner.ScanTokenAndProvideInfoAboutIt(TokenInfo tokeninfo, ref int state)
    {
        [...]

        if (FoundKeyword)
        {
            tokeninfo.Color = TokenColor.Keyword;
            return true;
        }

        [...]
    }
}

This is brilliant. I mean, with such an architecture, you can implement your own sublanguage, add colour schemes, modify whatever you want. Honestly, this is perfect for my needs.

So here is the plan:

  1. find the LanguageService class responsible for parsing C++
  2. inherit from it and reimplement GetScanner() to use our own IScanner
  3. in our version of ScanTokenAndProvideInfoAboutIt, just call the C++ scanner’s version of ScanTokenAndProvideInfoAboutIt and inspect tokeninfo
  4. if the token info says this is a keyword, match that keyword with our list of types, and change its colour if necessary
  5. register our new language service with a higher priority

This sounds pretty simple and elegant. It’s some kind of two-level proxy pattern.

Except it has absolutely no chance to work. Because there is no language service class for C++.

That’s right. Visual Studio does not use its own advertised architecture to handle the C++ language. If you want to slightly change the behaviour of the C++ language service, you need to fully reimplement it. And by fully reimplement, that means fully, even the completion stuff for IntelliSense.

Third Visual Studio attempt: a Classifier add-in

A classifier add-in differs from a regular language add-in in that it only affects the text that is being displayed. It has no knowledge of the language syntax or structure or what the underlying parser has analysed, but it does know about what the underlying classifier did. For instance, it doesn't know whether a given chunk of text is a C-style or a C++-style comment, but it does know that it was classified as "comment".

This proved to be the correct thing to use! My Visual Studio colour scheme now looks a lot more like my Vim setup:

There are still limitations, but it's a good start. When another plugin comes in and has higher priority, it undoes everything my add-in did, which is arguably the fault of those other plugins, but I believe the lack of a properly pluggable architecture is definitely the issue.

Further thoughts

I know this is a rant, but I will nonetheless add my own constructive information here, as well as anything readers may wish to contribute.

There are other paths I have not explored yet:

  • disassemble the Visual Studio DLLs

I am pretty sure people will suggest that I use VAX (Visual Assist X). I am already using it. I am even a paying customer. In fact I asked for that feature more than three years ago and was more or less ignored (the only answer I got was about a minor point where the person thought I was wrong — I wasn’t). While most of the bugs I reported against VAX were fixed, I have a problem with their stance on accessibility, illustrated by their attitude on this bug. My general feeling is that VAX is a pathetic, slow and annoying piece of crap. The only reason I do not rant more about it is that I know how painful it is to write Visual Studio extensions.

I asked for advice on StackOverflow but since my problem is very specific and probably has no solution, it’s not surprising that I haven’t got any answers yet.

Someone wanted to extend the syntax colouring but was told that apparently “this can't be accomplished with an add-in”, “you may be looking at implementing a full language service to provide this feature” and “Todays language services are currently not architected to be extensible”. One suggestion was to replace a whole COM object using only its CLSID.

Another person wanted to leverage existing language services from within Visual Studio in order to use it for his own language, and was told it was not possible. The workaround mentioned in that thread involves creating a whole new virtual project that would mirror files, hide them, rename them to .c or .h, and analyse the result.

Conclusion

Honestly, the only reasons I still use Visual Studio are:

  • I use it at work
  • a lot of people use it and I need to provide them with a usable environment
  • there’s no other acceptable way to develop for the Xbox 360

But given how it sucks, and has sucked for years, and made my life miserable, and how some of the bugs I have reported back in 1997 are still present, I can only hope that this pathetic piece of crap either becomes opensource (wishful thinking) or just dies and we get something really extensible instead.

The stolen bytes: Visual Studio, virtual methods and data alignment

This article describes a design choice in the C++ ABI of the Visual Studio compiler that I believe should be considered a bug. I propose a trivial workaround at the end.

TL;DR — if the topmost polymorphic class in a hierarchy has members with alignment requirement N where N > sizeof(void *), the Visual Studio compiler may add up to N bytes of useless padding to your objects.

Update: be sure to read the explanation by Jan Gray, who designed the relevant part of the MS C++ ABI some 22 years ago, in the comments section below.

My colleague Benlitz first hit the problem when trying to squeeze memory out of some of our game’s most often instantiated classes. I think it is best illustrated with the following minimal example:

class Foo
{
    virtual void Hello() {}

    float f;     /* 4 bytes */
};
class Bar
{
    virtual void Hello() {}

    float f;     /* 4 bytes */
    double d;    /* 8 bytes */
};

This is the size of Foo and Bar on various 32-bit platforms:

Platform sizeof(Foo) sizeof(Bar) Madness?
Linux x86 (gcc) 8 16 no
Linux ARMv9 (gcc) 8 16 no
Win32 (gcc) 8 16 no
Win32 (Visual Studio 2010) 8 24 yes
Xbox 360 (Visual Studio 2010) 8 24 yes
PlayStation 3 (gcc) 8 16 no
PlayStation 3 (SNC) 8 16 no
Mac OS X x86 (gcc) 8 16 no

There is no trick. This is by design. The Visual Studio compiler is literally stealing 8 bytes from us!

What the fuck is happening?

This is the memory layout of Foo on all observed platforms:

\begin{tabular}{|r|llll|llll|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7 \\
\hline
field & \multicolumn{4}{|c|}{\textit{vfptr}}
      & \multicolumn{4}{|c|}{\texttt{float f;}} \\
\hline
\end{tabular}

The vfptr field is a special pointer to the vtable. The vtable is probably the most widespread compiler-specific way to implement virtual methods. Since all the platforms studied here are 32-bit, this pointer requires 4 bytes. A float requires 4 bytes, too. The total size of the class is therefore 8 bytes.

This is the memory layout of Bar on eg. Linux using GCC:

\begin{tabular}{|r|llll|llll|llllllll|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7
     & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
\hline
field & \multicolumn{4}{|c|}{\textit{vfptr}}
      & \multicolumn{4}{|c|}{\texttt{float f;}}
      & \multicolumn{8}{|c|}{\texttt{double d;}} \\
\hline
\end{tabular}

The double type has an alignment requirement of 8 bytes, which makes it fit perfectly at byte offset 8.

And finally, this is the memory layout of Bar on Win32 using Visual Studio 2010:

\begin{tabular}{|r|llll|llll|llll|llll|llllllll|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7
     & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15
     & 16 & 17 & 18 & 19 & 20 & 21 & 22 & 23 \\
\hline
field & \multicolumn{4}{|c|}{\textit{vfptr}}
      & \multicolumn{4}{|c|}{\textit{padding}}
      & \multicolumn{4}{|c|}{\texttt{float f;}}
      & \multicolumn{4}{|c|}{\textit{padding}}
      & \multicolumn{8}{|c|}{\texttt{double d;}} \\
\hline
\end{tabular}

This is madness! The requirement for the class to be 8-byte aligned causes the first element of the class to be 8-byte aligned, too! I demand a rational explanation for this design choice.

The problem is that the compiler decides to add the vtable pointer after it has aligned the class data, resulting in excessive realignment.

Compilers affected

The Visual Studio compilers for Win32, x64 and Xbox 360 all appear to create spurious padding in classes.

Though this article focuses on 32-bit platforms for the sake of simplicity, 64-bit Windows is affected, too.

The problem becomes even worse with larger alignment requirements, for instance with SSE3 or AltiVec types that require 16-byte storage alignment such as _FP128:

class Quux
{
    virtual void Hello() {}

    float f;     /* 4 bytes */
    _FP128 dd;   /* 16 bytes */
};

This is the GCC memory layout on both 32-bit and 64-bit platforms:

\begin{tabular}{|r|c|c|c|c|c|c|c|c|}
\hline
byte & 0--3 & 4--7 & 8--11 & 12--15
     & 16--19 & 20--23 & 24--27 & 28--31 \\
\hline
\hline
field (32-bit) & \textit{vfptr}
                & \texttt{float f;}
                & \multicolumn{2}{|c|}{\textit{padding}}
                & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
field (64-bit) & \multicolumn{2}{|c|}{\textit{vfptr}}
               & \texttt{float f;}
               & \textit{padding}
               & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
\end{tabular}

The padding there is perfectly normal and expected, because of the alignment requirements for dd.

But this is how Visual Studio decides to lay it out:

\begin{tabular}{|r|c|c|c|c|c|c|c|c|c|c|c|c|}
\hline
byte & 0--3 & 4--7 & 8--11 & 12--15
     & 16--19 & 20--23 & 24--27 & 28--31
     & 32--35 & 36--39 & 40--43 & 44--47 \\
\hline
\hline
field (32-bit) & \textit{vfptr}
               & \multicolumn{3}{|c|}{\textit{padding}}
               & \texttt{float f;}
               & \multicolumn{3}{|c|}{\textit{padding}}
               & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
field (64-bit) & \multicolumn{2}{|c|}{\textit{vfptr}}
               & \multicolumn{2}{|c|}{\textit{padding}}
               & \texttt{float f;}
               & \multicolumn{3}{|c|}{\textit{padding}}
               & \multicolumn{4}{|c|}{\texttt{\_FP128 dd;}} \\
\hline
\end{tabular}

That is 16 lost bytes, both on 32-bit and 64-bit versions of Windows.

Workaround

There is fortunately a workaround if you want to get rid of the useless padding. It is so trivial that it actually makes me angry that the problem exists in the first place.

This will get you your bytes back:

class EmptyBase
{
protected:
    virtual ~EmptyBase() {}
};

class Bar : public EmptyBase
{
    virtual void Hello() {}

    float f;     /* 4 bytes */
    double d;    /* 8 bytes */
};

And this is the size of Bar on the same 32-bit platforms:

Platform sizeof(Bar)
Linux x86 (gcc) 16
Linux ARMv9 (gcc) 16
Win32 (gcc) 16
Win32 (Visual Studio 2010) 16
Xbox 360 (Visual Studio 2010) 16
PlayStation 3 (gcc) 16
PlayStation 3 (SNC) 16
Mac OS X x86 (gcc) 16

Phew. Sanity restored.

\begin{tabular}{|r|cccc|cccc|cccccccc|}
\hline
byte & 0 & 1 & 2 & 3 & 4 & 5 & 6 & 7
     & 8 & 9 & 10 & 11 & 12 & 13 & 14 & 15 \\
\hline
\texttt{EmptyBase} fields & \multicolumn{4}{|c|}{\textit{vfptr}}
                          & \multicolumn{12}{|c|}{} \\
\hline
\texttt{Bar} fields & \multicolumn{4}{c}{\texttt{EmptyBase}}
                    & \multicolumn{4}{|c|}{\texttt{float f;}}
                    & \multicolumn{8}{|c|}{\texttt{double d;}} \\
\hline
\end{tabular}

The compiler is a lot less confused now: it no longer has to create space for a vfptr in Bar since it is technically already part of EmptyBase.

Conclusion

Lessons learned:

  • The pointer to the vtable isn’t just like any other pointer.
  • Various C++ ABIs have different stances on padding and alignment.
  • Inheriting from an empty abstract class can make your objects smaller on Windows and Xbox 360!
  • Design decisions can haunt you for decades!

The workaround is so simple that it sounds like a good idea to always use it, preemptively.

Setting up a real Compose key on Mac OS X

The Compose key is my method of choice for international character input. It lets me type characters as diverse as é Â ẃ ṗ § … « ¿ ¥ ¹ ½ © × using simple and intuitive key combinations.

How intuitive exactly? Let’s see:

  • Compose + C + , gives me Ç.
  • Compose + l + / gives me ł.
  • Compose + < + < gives me «.
  • Compose + - + > gives me →.
  • Compose + < + 3 gives me ♥.
  • Compose + C + C + C + P gives me ☭ (no kidding).

You can of course set up your own rules. This shit is so powerful that I cannot imagine I could ever use any other input method.

So, one would think that with all its glorious Unix heritage, Mac OS X would let you get the most out of your keyboard like the good old X11 system does. Well it turns out it’s possible, but not straightforward.

Fortunately, other people already did all the work. I will just indicate how to put their stuff together.

Step 1: choose a Compose key

Choose the Compose key so that it is easily accessible but does not prevent you from doing anything you ordinarily do. Fortunately, modern keyboards come with more and more idiotic and useless keys.

I use the Right Alt key as my Compose key. I already have a Left Alt key so the right one is a bit useless to me. And it somehow matches the position of the Compose key on old Sun keyboards.

That would be Right Option on a Mac keyboard. I recommend that.

Step 2: remap the Compose key

The problem is that the Mac OS X keyboard preferences:

  • do not let you differentiate between Left and Right Option keys
  • only let you remap modifier keys to another modifier key (or to nothing)

Fortunately, there is KeyRemap4MacBook that lets you do very low level things with your keyboard. Install it.

We will now remap our compose key to something that the next layer will understand. I chose Shift-Control-F13 for that. It is very unlikely you will need that key combination.

In the file ~/Library/Application Support/KeyRemap4MacBook/private.xml put the following:

<?xml version="1.0"?>
<root>
  <item>
    <name>Send Shift-Ctrl-F13 for Right Option</name>
    <identifier>private.send_shift_ctrl_f13_for_ropt</identifier>
    <autogen>__KeyToKey__ KeyCode::OPTION_R,
                          KeyCode::F13, ModifierFlag::SHIFT_L
                                      | ModifierFlag::CONTROL_L
    </autogen>
  </item>
</root>

Finally, from the System Preferences, open the KeyRemap4MacBook settings and click on the ReloadXML button:

The new option should appear. Activate it:

Step 3: create compose bindings

The last step is the creation of the actual bindings. I chose to import the rules from /usr/share/X11/locale/en_US.UTF-8/Compose on my Debian system.

Bob Kåres wrote a script that lets you convert X11 compose rules into Cocoa key bindings.

Either convert a Compose file of your own using Bob’s script, or download my DefaultKeyBinding.dict. Save it in ~/Library/KeyBindings/DefaultKeyBinding.dict.

Be careful: by default Bob’s script uses F13 instead of Shift-Ctrl-F13 so in DefaultKeyBinding.dict you need to change:

    "\UF710"

into:

    "^$\UF710"

If for some reason you decided to go for another combination, check out this article by Xah Lee to find out the proper syntax.

Step 4: restart all applications

And that’s it! Your Mac OS X system is now slightly more usable.

  • Posted: 2012-06-18 08:18 (Updated: 2015-09-05 19:38)
  • Author: sam
  • Categories: a11y osx tip
  • Comments (511)

Official IRC channel

The official IRC channel for Lol Engine developers and users is now #LOL on irc.oftc.net.

Anything related to the Lol Engine usage or development is on-topic, but also game development, rendering, maths, physics, though of course there are better channels for that.

You can access #LOL directly using the following link: irc://irc.oftc.net/#LOL.

LINK : fatal error LNK1104: cannot open file 'XAPID.lib'

Ever got a link error for a library that was referenced nowhere in your Visual Studio project or even in the final link.exe command line? Here's a hint: check the contents of static libraries, too. They may be pulling unexpected dependencies behind your back!

If the static library is part of your solution, here is another hint: check that the [Configuration Properties] >> [C/C++] >> [Code Generation] >> [Runtime Library] configuration values match across projects.

Beyond De Bruijn: fast binary logarithm of a 10-bit number

Recently I needed a method for retrieving the binary logarithm of a 10-bit number (for the curious, it was for the purpose of converting between 32-bit and 16-bit floating point numbers).

Computing the binary logarithm is equivalent to knowing the position of the highest order set bit. For instance, log2(0x1) is 0 and log2(0x100) is 8.

One well known method for fast binary logarithm is presented at Bit Twiddling Hacks. It is a two-step method where first all lower bits are set to 1 and then a De Bruijn-like sequence is used to perform a table lookup:

int fastlog2(uint32_t v)
{
    static const int MultiplyDeBruijnBitPosition[32] = 
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1;
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;

    return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
}

That is 12 integer operations and a table lookup.

Optimising

It should be obvious what the sequence of operations on v does: fill the integer with ones starting from the highest order bit. Here are a few examples of what happens to v at each step:

v v |= v >> 1 v |= v >> 2 v |= v >> 4 v |= v >> 8 v |= v >> 16
0x0001 0x0001 0x0001 0x0001 0x0001 0x0001
0x0002 0x0003 0x0003 0x0003 0x0003 0x0003
0x0003 0x0003 0x0003 0x0003 0x0003 0x0003
0x0004 0x0006 0x0007 0x0007 0x0007 0x0007
0x0100 0x0180 0x01e0 0x01fe 0x01ff 0x01ff
0x80000000 0xc0000000 0xf0000000 0xff000000 0xffff0000 0xffffffff

There is one obvious optimisation available: since the input is only 10-bit, the last shift operation v |= v >> 16 can be omitted because the final value was already reached.

int fastlog2(uint32_t v)
{
    static const int MultiplyDeBruijnBitPosition[32] = 
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1;
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;

    return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27];
}

10 instructions instead of 12. Not really amazing, but worth mentioning.

Optimising more?

Could we do better? Now the last line is v |= v >> 8; and it is only useful to propagate the 9th and 10th bits to positions 1 and 2. What happens if we omit that line? Let’s see:

  • For most values of v, the expected value is obtained.
  • For values of v with a highest order bit at 9th position, 0x1fe could be obtained instead of 0x1ff.
  • For values of v with a highest order bit at 10th position, one of 0x3fc, 0x3fd or 0x3fe could be obtained instead of 0x3ff.

The list of possible output values would therefore be 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f, 0x7f, 0xff, 0x1fe, 0x1ff, 0x3fc, 0x3fd, 0x3fe, 0x3ff. What happens to these values when multiplying them with the De Bruijn sequence? Let's see:

v v * 0x07C4ACDDU) >> 27
0x1 0
0x3 2
0x7 6
0xf 14
0x1f 30
0x3f 29
0x7f 27
0xff 23
0x1fe 15
0x1ff 16
0x3fc 30
0x3fd 31
0x3fe 0
0x3ff 1

Damn! Two values are colliding. It looks like we cannot omit the last line after all.

Beyond De Bruijn

Let’s give another try at the problem. Usually De Bruijn sequences are built using either nontrivial algorithms, or brute force. Maybe we could find another sequence that has no collision? Or a sequence that is not a De Bruijn sequence but that works for our problem?

Well, let’s just brute force!

(2 seconds later)

int fastlog2(uint32_t v)
{
    static const int MagicTable[16] = 
    {
        0, 1, 2, 8, -1, 3, 5, 9, 9, 7, 4, -1, 6, -1, -1, -1
    };

    v |= v >> 1;
    v |= v >> 2;
    v |= v >> 4;

    return MagicTable[(uint32_t)(v * 0x5a1a1a2u) >> 28];
}

Down to 8 instructions instead of 12. And the lookup table is now half the size!

Conclusion

It is possible for multiply-and-shift techniques similar to the De Bruijn sequence algorithm to exist for a larger set of problems. Brute forcing the search is a totally valid method for 32-bit multiplication.

The code used for this article is included in the attached file.

Maths trick: doing fewer comparisons

Note: this is not an optimisation. It is just one more tool you should have in your toolbox when looking for optimisations. It may be useful.

This is the trick:

\[\min(x,y) = \dfrac{x + y - |x - y|}{2}\]
\[\max(x,y) = \dfrac{x + y + |x - y|}{2}\]

You can check for yourself that it is always true: when x > y, |x - y| is the same as x - y, etc.

What good is it for? There is often an implicit comparison in min or max. It might be interesting to replace it with a call to the branchless fabs.

Example usage

Consider the following code:

float a, b, c, d;
/* ... */
return (a > b) && (c > d);

That kind of code is often used eg. in collision checks, where a lot of tests can be done. This code does two comparisons. On some architectures, this means two branches. Not always something you want.

The test condition is equivalent to:

(a - b > 0) && (c - d > 0)

Now when are two given numbers both positive? That is if and only if the smallest is positive:

min(a - b, c - d) > 0

We may now use our trick:

(a - b) + (c - d) - |(a - b) - (c + d)| > 0

And so the code could be rewritten as such:

float a, b, c, d;
/* ... */
return (a - b) + (c - d) > fabsf((a - b) - (c - d));

We basically replaced the additional test with a call to fabsf and some additions/subtractions. It may be possible to reorganise the input data so that this second version performs better.

C++ trick: selectively restrict implicit conversions

TL;DR: given a class Foo with an implicit constructor from int, how to allow the implicit conversion in f(42); but not in g(42); where both f and g take a Foo const & argument?

Background

So I have this real class that performs numeric operations that I want use just like any other C++ numeric type. For instance, I can write the following:

float f = 15, g = 3.5;
int x = f / g;

If I decide that I need double precision, I can write:

double f = 15, g = 3.5;
int x = f / g;

And of course, using my real class for even higher precision:

real f = 15, g = 3.5;
int x = f / g;

I like that. I can just write code as usual, and when I need higher precision, I use real instead of double. It's transparent and convenient.

Implementation example

Here is a highly simplified example of a real class:

struct real
{
    inline real(double d) : m_value(d) {}
    inline operator int() const { return (int)m_value; }
    /* ... */
    long double m_value;
};

It is possible to write real f = 15 because of the implicit constructor. Actually, C++ constructors are always implicit unless specified otherwise.

It is possible to write int x = f / g because of the conversion operator.

So far, so good.

The problem with implicit promotion

Here is how fabs could be implemented:

real fabs(real const &r)
{
    return real(r.m_value < 0 ? -r.m_value : r.m_value);
}

But now we have a problem. A subtle problem. Consider the following code:

double x = fabs(-5.0);

What does this do? Well, it depends. It depends whether <cmath> was included or not. Because if <cmath> wasn’t included, then that code is going to automatically promote -5.0 to a real and call our custom function instead of the one provided by the math library! With no compile-time warning!

This is confusing. It should not happen. But it is a well known problem and there are several obvious workarounds:

  1. What most professional C++ programmers will tell you: use namespaces
  2. Mark the real(int) constructor explicit

The problem with 1. is that I am not a professional C++ programmer. I am a C programmer who uses C++. I use preprocessor macros and printf and memalign and goto. Try and stop me!

The problem with 2. is that I can no longer write real f = 15, I would need real f(15) or real f = real(15) instead. This is not acceptable, I want real to behave exactly like float and others, to the fullest extent of what the language allows.

Another solution

Fortunately, the C++ standard has a solution for us: “Implicit conversions will be performed [...] if the parameter type contains no template-parameters that participate in template argument deduction” (ISO/IEC 14882:1998, section 14.8.1.4). You cannot have implicit conversion and template argument deduction at the same time.

It means we just have to make fabs a template function! Which means making real a template class, too.

A quick way to fix real would be:

/* N is unused */
template<int N> struct real_base
{
    inline real_base(double d) : m_value(d) {}
    inline operator int() const { return (int)m_value; }
    /* ... */
    long double m_value;
};

typedef real_base<0> real;

The template argument is useless, unfortunately. It will just have to be here, forever. But who knows, you might find a use for it one day.

And to fix fabs:

/* A generic template declaration is needed */
template<int N> real_base<N> fabs(real_base<N> const &r);

/* Here we just add template<> to the previous version */
template<>
real fabs(real const &r)
{
    return real(r.m_value < 0 ? -r.m_value : r.m_value);
}

So, what happens with double x = fabs(-5.0); when we forget to include <cmath> now? Well, here is what GCC says:

In function ‘int main()’:
error: no matching function for call to ‘fabs(double)’
note: candidate is:
note: template<int N> real_base<N> fabs(const real_base<N>&)

It seems we’ve successfully managed to avoid the problematic implicit conversion, yet still allow it in places where it was useful!

So what is the rule? It’s simple: where implicit conversion should not be allowed, make the function a specialised template function.

  • Posted: 2012-02-08 22:14 (Updated: 2012-02-09 00:31)
  • Author: sam
  • Categories: code tip c++
  • Comments (408)

C/C++ trick: static string hash generation

I am always interested in having the compiler do more things for me, without giving away code clarity or performance for the convenience. Today a colleague linked me to Pope Kim's Compile-Time Hash String Generation article which is a perfect example of the things I like: hidden syntactic sugar that does useful things.

Inline hash function

The goal: for a given hash function, write something like HASH_STRING("funny_bone") in the code, and have the compiler directly replace it with the result, 0xf1c6fd7f.

The solution: inline the function and hope that the compiler will be clever enough.

#include <string.h>
#define HASH(str) generateHash(str, strlen(str))

inline unsigned int generateHash(const char *string, size_t len)
{
    unsigned int hash = 0;
    for(size_t i = 0; i < len; ++i)
        hash = 65599 * hash + string[i];
    return hash ^ (hash >> 16);
}

Unfortunately Pope ran into several very problematic issues:

  • requires heavy optimisation flags (/O2 with Visual Studio, -O3 with g++)
  • limited to 10-character strings with Visual Studio
  • limited to 17-character strings with g++

I could personally reproduce the g++ limitations. I believe they are more related to loop unrolling limits than to the actual string size, but they indeed make the technique unusable in practice.

Macro-based hash function

If you read my previous article about C/C++ preprocessor LUT generation, you may remember that it used preprocessor tricks to do loop unrolling.

Hence the following implementation:

#include <string.h>
#include <stdint.h>
#include <stdio.h>

#define H1(s,i,x)   (x*65599u+(uint8_t)s[(i)<strlen(s)?strlen(s)-1-(i):strlen(s)])
#define H4(s,i,x)   H1(s,i,H1(s,i+1,H1(s,i+2,H1(s,i+3,x))))
#define H16(s,i,x)  H4(s,i,H4(s,i+4,H4(s,i+8,H4(s,i+12,x))))
#define H64(s,i,x)  H16(s,i,H16(s,i+16,H16(s,i+32,H16(s,i+48,x))))
#define H256(s,i,x) H64(s,i,H64(s,i+64,H64(s,i+128,H64(s,i+192,x))))

#define HASH(s)    ((uint32_t)(H256(s,0,0)^(H256(s,0,0)>>16)))

It has the following properties:

  • works in C in addition to C++
  • strings are always optimised away by gcc or g++ (but not always the computation itself)
  • hash computation is optimised away by gcc or g++ even with -O, -O1 or -Os
  • string size limit is 256 characters (probably more than enough for most uses) and can be manually increased or decreased

The following code:

int main(void)
{
    printf("%08x\n", HASH("funny_bone"));
    printf("%08x\n", HASH("incredibly_large_string_that_gcc_groks_easily"));
}

Is (correctly) optimised to this with gcc -Os:

  ...
  movl    $-238617217, %esi
  movl    $.LC0, %edi
  xorl    %eax, %eax
  call    printf
  movl    $-453669173, %esi
  movl    $.LC0, %edi
  xorl    %eax, %eax
  call    printf
  ...

I haven't tested it with Visual Studio. Feedback from this compiler would be very appreciated!

  • Posted: 2012-01-12 18:05 (Updated: 2012-01-12 18:07)
  • Author: sam
  • Categories: code tip
  • Comments (412)

Announce: LolRemez 0.2 released

A new version of our high precision polynomial approximation solver, LolRemez 0.2, is available.

The changes, taking into account all the feedback users provided, are as follows:

  • significant performance and accuracy improvements thanks to various bugfixes and a better extrema finder for the error function.
  • user can now define accuracy of the final result.
  • exp, sin, cos and tan are now about 20% faster.
  • multiplying a real number by an integer power of two is now a virtually free operation.
  • fixed a rounding bug in the real number printing routine.

You can visit the software homepage to download LolRemez and, more importantly, the comprehensive documentation featuring a step-by-step tutorial.

Announce: LolRemez 0.1 released

In my previous article about the Remez exchange algorithm I said I was working on a Remez exchange toolkit for everyone to play with. Though it’s far from being full-featured, I already use it and I believe it is already extremely useful. So I decided to release LolRemez 0.1 to the masses.

You can visit the software homepage to download LolRemez and, more importantly, the comprehensive documentation featuring a step-by-step tutorial.

Better function approximations: Taylor vs. Remez

You may have once crossed this particular piece of magic:

$\sin(a) = a - \dfrac{a^3}{3!} + \dfrac{a^5}{5!} - \dfrac{a^7}{7!} + \dfrac{a^9}{9!} + \dots$

The right part is the Taylor series of sin around 0. It converges very quickly to the actual value of sin(a). This allows a computer to compute the sine of a number with arbitrary precision.

And when I say it’s magic, it’s because it is! Some functions, called the entire functions, can be computed everywhere using one single formula! Other functions may require a different formula for different intervals; they are the analytic functions, a superset of the entire functions. In general, Taylor series are an extremely powerful tool to compute the value of a given function with very high accuracy, because for several common functions such as sin, tan or exp the terms of the series are easy to compute and, when implemented on a computer, can even be stored in a table at compile time.

Approximating sin with Taylor series

This is how one would approximate sin using 7 terms of its Taylor series on the [-π/2, π/2] interval. The more terms, the better the precision, but we’ll stop at 7 for now:

static double taylorsin(double x)
{
    static const 
    double a0 =  1.0,
           a1 = -1.666666666666666666666666666666e-1,  /* -1/3! */
           a2 =  8.333333333333333333333333333333e-3,  /*  1/5! */
           a3 = -1.984126984126984126984126984126e-4,  /* -1/7! */
           a4 =  2.755731922398589065255731922398e-6,  /*  1/9! */
           a5 = -2.505210838544171877505210838544e-8,  /* -1/11! */
           a6 =  1.605904383682161459939237717015e-10; /*  1/13! */
    double x2 = x * x;
    return x * (a0 + x2 * (a1 + x2 * (a2 + x2
             * (a3 + x2 * (a4 + x2 * (a5 + x2 * a6))))));
}

And you may think…

/raw-attachment/blog/2011/12/14/understanding-motion-in-games/derp.png “Oh wow that is awesome! So simple for such a difficult function. Also, since I read your masterpiece about polynomial evaluation I know how to improve that function so that it is very fast!”

Well, actually, no.

/raw-attachment/blog/2011/12/14/understanding-motion-in-games/no-rage.png

If you are approximating a function over an interval using its Taylor series then either you or the person who taught you is a fucking idiot because a Taylor series approximates a function near a fucking point, not over a fucking interval, and if you don’t understand why it’s important then please read on because that shit is gonna blow your mind.

Error measurement

Let’s have a look at how much error our approximation introduces. The formula for the absolute error is simple:

$\text{E}(x) = \left|\sin(x) - \text{taylorsin}(x)\right|$

And this is how it looks like over our interval:

You can see that the error skyrockets near the edges of the [-π/2, π/2] interval.

/raw-attachment/blog/2011/12/14/understanding-motion-in-games/derp.png “Well the usual way to fix this is to split the interval in two or more parts, and use a different Taylor series for each interval.”

Oh, really? Well, let’s see the error on [-π/4, π/4] instead:

I see no difference! The error is indeed smaller, but again, it becomes extremely large at the edges of the interval. And before you start suggesting reducing the interval even more, here is the error on [-π/8, π/8] now:

I hope this makes it clear that:

  • the further from the centre of the interval, the larger the error
  • the error distribution is very unbalanced
  • the maximum error on [-π/2, π/2] is about 6.63e-10

And now I am going to show you why that maximum error value is pathetic.

A better approximation

Consider this new function:

static double minimaxsin(double x)
{
    static const
    double a0 =  1.0,
           a1 = -1.666666666640169148537065260055e-1,
           a2 =  8.333333316490113523036717102793e-3,
           a3 = -1.984126600659171392655484413285e-4,
           a4 =  2.755690114917374804474016589137e-6,
           a5 = -2.502845227292692953118686710787e-8,
           a6 =  1.538730635926417598443354215485e-10;
    double x2 = x * x;
    return x * (a0 + x2 * (a1 + x2 * (a2 + x2
             * (a3 + x2 * (a4 + x2 * (a5 + x2 * a6))))));
}

It doesn’t look very different, right? Right. The values a0 to a6 are slightly different, but the rest of the code is strictly the same.

Yet what a difference it makes! Look at this error curve:

That new function makes it obvious that:

  • the error distribution is better spread over the interval
  • the maximum error on [-π/2, π/2] is about 4.96e-14

Check that last figure again. The new maximum error isn’t 10% better, or maybe twice as good. It is more than ten thousand times smaller!!

The minimax polynomial

The above coefficients describe a minimax polynomial: that is, the polynomial that minimises a given error when approximating a given function. I will not go into the mathematical details, but just remember this: if the function is sufficiently well-suited (as sin, tan, exp etc. are), then the minimax polynomial can be found.

The problem? It’s hard to find. The most popular algorithm to find it is the Remez exchange algorithm, and few people really seem to understand how it works (or there would be a lot less Taylor series). I am not going to explain it right now. Usually you need professional math tools such as Maple or Mathematica if you want to compute a minimax polynomial. The Boost library is a notable exception, though.

But you saw the results, so stop using Taylor series. Spending some time finding the minimax polynomial is definitely worth it. This is why I am working on a Remez framework that I will make public and free for everyone to use, modify and do what the fuck they want. In the meantime, if you have functions to numerically approximate, or Taylor-based implementations that you would like to improve, let me know in the comments! This will be great use cases for me.

C/C++ trick: static lookup table generation

There are two major ways of using lookup tables (LUTs) in C/C++ code:

  • build them at runtime,
  • embed them in the code.

One major advantage of runtime initialisation is the choice between static initialisation (at program startup), or lazy initialisation (on demand) to save memory. Also, the generating code can be complex, or use information that is only available at runtime.

In the case of an embedded table, the generation cost is only at compile time, which can be very useful. Also, the compiler may take advantage of its early knowledge of the table contents to optimise code. However, quite often the content of embedded tables is abstruse and hardly useful to someone viewing the code. Usually this is due to the use of an external program for generation, sometimes in a completely different language. But the generation can also often be done using the C/C++ preprocessor.

Practical example

Consider the bit interleaving routine at Sean Eron Anderson's Bit Twiddling Hacks page (which, by the way, I recommend you read and bookmark). It uses the following LUT (shortened for brevity):

static const unsigned short MortonTable256[256] = 
{
  0x0000, 0x0001, 0x0004, 0x0005, 0x0010, 0x0011, 0x0014, 0x0015, 
  0x0040, 0x0041, 0x0044, 0x0045, 0x0050, 0x0051, 0x0054, 0x0055, 
  0x0100, 0x0101, 0x0104, 0x0105, 0x0110, 0x0111, 0x0114, 0x0115, 
  ... 32 lines in total ...
  0x5540, 0x5541, 0x5544, 0x5545, 0x5550, 0x5551, 0x5554, 0x5555
};

The MortonTable256 table has, as its name suggests, 256 elements. It was pregenerated by some external piece of code which probably looked like this:

for (int i = 0; i < 256; i++)
    MortonTable256[i] = (i & 1) | ((i & 2) << 1) | ((i & 4) << 2) | ((i & 8) << 3);

The problem with that external piece of code is that it is external. You cannot write it in this form and have it fill the table at compile time.

If you only take the output of this table, the information on how the table was created is lost. It makes it impractical to build another table that has, for instance, all values shifted one bit left. Even if such a table was created using a modified version of the above code, switching between the two tables would be a hassle unless both versions were kept between preprocessor tests.

Preprocessor iterator

Here is one way to get the best of both worlds. First, declare the following iterator macros. They can be declared somewhere in a global .h, maybe with more descriptive names:

#define S4(i)    S1((i)),   S1((i)+1),     S1((i)+2),     S1((i)+3)
#define S16(i)   S4((i)),   S4((i)+4),     S4((i)+8),     S4((i)+12)
#define S64(i)   S16((i)),  S16((i)+16),   S16((i)+32),   S16((i)+48)
#define S256(i)  S64((i)),  S64((i)+64),   S64((i)+128),  S64((i)+192)
#define S1024(i) S256((i)), S256((i)+256), S256((i)+512), S256((i)+768)

Their purpose is simple: calling eg. S16(i) will expand to S1(i), S1(i+1), …, S1(i+15). Similarly, S256(i) will call S1 with values from i to i + 255 times.

And this is how to use them in our example:

static const unsigned short MortonTable256[256] = 
{
#define S1(i) ((i & 1) | ((i & 2) << 1) | ((i & 4) << 2) | ((i & 8) << 3))
    S256(0)
#undef S1
};

That's it! The table will be built at compile time, and you get to keep the logic behind it.

A more complex example

Jeroen van der Zijp's fast half float conversions paper describes table-based methods to convert between 16-bit and 32-bit floating point values. The construction of one of the LUTs is as follows:

void generatetables(){
  for(unsigned int i=0; i<256; ++i){
    int e=i-127;
    if(e<-24){                  // Very small numbers map to zero
      basetable[i|0x000]=0x0000;
      basetable[i|0x100]=0x8000;
    } else if(e<-14){             // Small numbers map to denorms
      basetable[i|0x000]=(0x0400>>(-e-14));
      basetable[i|0x100]=(0x0400>>(-e-14)) | 0x8000;
    } else if(e<=15){             // Normal numbers just lose precision
      basetable[i|0x000]=((e+15)<<10);
      basetable[i|0x100]=((e+15)<<10) | 0x8000;
    } else if(e<128){             // Large numbers map to Infinity
      basetable[i|0x000]=0x7C00;
      basetable[i|0x100]=0xFC00;
    } else{                       // Infinity and NaN's stay Infinity and NaN's
      basetable[i|0x000]=0x7C00;
      basetable[i|0x100]=0xFC00;
    }
  }
}

And this is the compile-time version :

static uint16_t const basetable[512] =
{
#define S1(i) (((i) < 103) ? 0x0000 : \
               ((i) < 113) ? 0x0400 >> (0x1f & (113 - (i))) : \
               ((i) < 143) ? ((i) - 112) << 10 : 0x7c00)
    S256(0),
#undef S1
#define S1(i) (0x8000 | basetable[i])
    S256(0),
#undef S1
};

In this case the macro code is slightly bigger and was slightly rewritten, but is no more complicated than the original code. Note also the elegant reuse of previous values in the second half of the table.

This trick is certainly not new, but since I have found practical uses for it, I thought you may find it useful, too.

  • Posted: 2011-12-20 22:21 (Updated: 2011-12-20 22:26)
  • Author: sam
  • Categories: code tip
  • Comments (548)

Understanding basic motion calculations in games: Euler vs. Verlet

During the past month, I have found myself in the position of having to explain the contents of this article to six different persons, either at work or over the Internet. Though there are a lot of articles on the subject, it’s still as if almost everyone gets it wrong. I was still polishing this article when I had the opportunity to explain it a seventh time.

And two days ago a coworker told me the source code of a certain framework disagreed with me… The kind of framework that probably has three NDAs preventing me from even thinking about it.

Well that framework got it wrong, too. So now I’m mad at the entire world for no rational reason other than the ever occurring realisation of the amount of wrong out there, and this article is but a catharsis to deal with my uncontrollable rage.

A simple example

Imagine a particle with position Pos and velocity Vel affected by acceleration Accel. Let’s say for the moment that the acceleration is constant. This is the case when only gravity is present.

A typical game engine loop will update position with regards to a timestep (often the duration of a frame) using the following method, known as Euler integration:

Particle::Update(float dt)
{
    Accel = vec3(0, 0, -9.81); /* Constant acceleration: gravity */
    Vel = Vel + Accel * dt;    /* New, timestep-corrected velocity */
    Pos = Pos + Vel * dt;      /* New, timestep-corrected position */
}

This comes directly from the definition of acceleration:

\[a(t) = \frac{\mathrm{d}}{\mathrm{d}t}v(t)\]
\[v(t) = \frac{\mathrm{d}}{\mathrm{d}t}p(t)\]

Putting these two differential equations into Euler integration gives us the above code.

Measuring accuracy

Typical particle trajectories would look a bit like this:

These are three runs of the above simulation with the same initial values.

  • once with maximum accuracy,
  • once at 60 frames per second,
  • once at 30 frames per second.

You can notice the slight inaccuracy in the trajectories.

You may think…

“Oh, it could be worse; it’s just the expected inaccuracy with different framerate values.”

Well, no.

Just… no.

If you are updating positions this way and you do not have a really good reason for doing so then either you or the person who taught you is a fucking idiot and should not have been allowed to write so-called physics code in the first place and I most certainly hope to humbly bestow enlightenment upon you in the form of a massive cluebat and don’t you dare stop reading this sentence before I’m finished.

Why this is wrong

When doing kinematics, computing position from acceleration is an integration process. First you integrate acceleration with respect to time to get velocity, then you integrate velocity to get position.

\[v(t) = \int_0^t a(t)\,\mathrm{d}t\]
\[p(t) = \int_0^t v(t)\,\mathrm{d}t\]

The integral of a function can be seen as the area below its curve. So, how do you properly get the integral of our velocity between t and t + dt, ie. the green area below?

It’s not by doing new_velocity * dt (left image).

It’s not by doing old_velocity * dt either (middle image).

It’s obviously by doing (old_velocity + new_velocity) * 0.5 * dt (right image).

And now for the correct code

This is what the update method should look like. It’s called Velocity Verlet integration (not strictly the same as Verlet integration, but with a similar error order) and it always gives the perfect, exact position of the particle in the case of constant acceleration, even with the nastiest framerate you can think of. Even at two frames per second.

Particle::Update(float dt)
{
    Accel = vec3(0, 0, -9.81);
    vec3 OldVel = Vel;
    Vel = Vel + Accel * dt;
    Pos = Pos + (OldVel + Vel) * 0.5 * dt;
}

And the resulting trajectories at different framerates:

Further readings

“Oh wow thank you. But what if acceleration is not constant, like in real life?”

Well I am glad you asked.

Euler integration and Verlet integration are part of a family of iterative methods known as the Runge-Kutta methods, respectively of first order and second order. There are many more for you to discover and study.

  • Richard Lord did this nice and instructive animated presentation about several integration methods.
  • Glenn Fiedler also explains in this article why idiots use Euler, and provides a nice introduction to RK4 together with source code.
  • Florian Boesch did a thorough coverage of various integration methods for the specific application of gravitation (it is one of the rare cases where Euler seems to actually perform better).

In practice, Verlet will still only give you an approximation of your particle’s position. But it will almost always be a much better approximation than Euler. If you need even more accuracy, look at the fourth-order Runge-Kutta (RK4) method. Your physics will suck a lot less, I guarantee it.

Acknowledgements

I would like to thank everyone cited in this article, explicitly or implicitly, as well as the commenters below who spotted mistakes and provided corrections or improvements.

GLSL code snippet: choosing from 4 vectors by Z value

There aren’t many low-level GLSL optimisation resources out there, so I decided that I would share my thoughts when working on some specific parts of my code.

The basic, complete shader function

This one time I had four vec3 vectors, with an xy texture coordinate, and a weight stored in z. The code to compute the final pixel value was:

vec4 getColor(vec3 a, vec3 b, vec3 c, vec3 d)
{
    vec4 pa = texture2D(tex, a.xy) * a.z;
    vec4 pb = texture2D(tex, b.xy) * b.z;
    vec4 pc = texture2D(tex, c.xy) * c.z;
    vec4 pd = texture2D(tex, d.xy) * d.z;
    return (pa + pb + pc + pd) / (a.z + b.z + c.z + d.z);
}

That is four texture lookups, which is expensive.

The lightweight version for coarse levels of detail

If I wanted a more lightweight fragment shader, for instance when implementing variable levels of shader complexity, I would want to do only one texture lookup, and use the vector with the largest weight:

vec4 getColorFast(vec3 a, vec3 b, vec3 c, vec3 d)
{
    if (a.z < c.z) // These two tests are
        a = c;     // likely to be run in
    if (b.z < d.z) // parallel because they use
        b = d;     // independent data.
    if (a.z < b.z)
        a = b;
    return texture2D(tex, a.xy);
}

Only one texture lookup, but three branches. Branches are expensive and should be avoided.

Fortunately, GLSL provides step() and mix() (in HLSL or Cg, step() and lerp()) that let us do things similar to fsel on the PowerPC, or vec_sel in AltiVec: a branch-free select.

vec4 getColorFaster(vec3 a, vec3 b, vec3 c, vec3 d)
{
    a = mix(a, c, step(a.z, c.z)); // Again, potentially good
    b = mix(b, d, step(b.z, d.z)); // scheduling between these lines
    a = mix(a, b, step(a.z, b.z));
    return texture2D(tex, a.xy);
}

Excellent! Only six instructions in addition to the texture lookup.

But if you are familiar with SSE or AltiVec-style SIMD programming on the CPU, you will know this is not the usual way to do. Rather than 4 vectors of 3 elements, SIMD programming prefers to work in parallel on 3 vectors of 4 X, Y and Z components:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
    /* Now what do we do here? */
}

One nice thing to realise is that the equivalent of our previous step(a.z, c.z) and step(b.z, d.z) tests can be done in parallel:

vec4 getColorShuffled(vec4 allx, vec4 ally, vec4 allz)
{
    // compare a.z >? c.z and b.z >? d.z in parallel
    vec2 t = step(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]));
    // choose between a and c using t[0], between b and d using t[1]
    vec2 twox = mix(vec2(allx[0], allx[2]), vec2(allx[1], allx[3]), t);
    vec2 twoy = mix(vec2(ally[0], ally[2]), vec2(ally[1], ally[3]), t);
    vec2 twoz = mix(vec2(allz[0], allz[2]), vec2(allz[1], allz[3]), t);
    // compare a.z and b.z
    float s = step(twoz[0], twoz[1]);
    // now choose between a and b using s
    vec2 best = vec2(mix(twox[0], twox[1], t2), mix(twoy[0], twoy[1], s));
    return texture2D(tex, best);
}

Wow, that’s a bit complex. And even if we’re doing two calls to step() instead of three, there are now five calls to mix() instead of three. Fortunately, thanks to swizzling, we can combine most of these calls to mix():

vec4 getColorShuffledFast(vec4 allx, vec4 ally, vec4 allz)
{
    vec2 t = step(allz.ag, allz.rb);
    vec4 twoxy = mix(vec4(allx.ag, ally.ag), vec4(allx.rb, ally.rb), t.xyxy);
    vec2 twoz  = mix(allz.ag, allz.rb, t);
    float t2 = step(twoz.a, twoz.r);
    vec2 best = mix(twoxy.ag, twoxy.rb, t2);
    return texture2D(tex, best);
}

That’s it! Only three mix() and two step() instructions. Quite a few swizzles, but these are extremely cheap on modern GPUs.

Afterthoughts

The above transformation was at the “cost” of a big data layout change known as array of structures to structure of arrays. When working in parallel on similar data, it is very often a good idea, and the GPU was no exception here.

This was actually a life saver when trying to get a fallback version of a shader to work on an i915 card, where mix and step must be emulated using ALU instructions, up to a maximum of 64. The result can be seen in this NaCl plugin.

  • Posted: 2011-11-30 23:53 (Updated: 2011-12-04 11:53)
  • Author: sam
  • Categories: glsl optim
  • Comments (35)

Playing with the CPU pipeline

This article will show how basic knowledge of a modern CPU’s instruction pipeline can help micro-optimise code at very little cost, using a real world example: the approximation of a trigonometric function. All this without necessarily having to look at lines of assembly code.

The code used for this article is included in the attached file.

Evaluating polynomials

Who needs polynomials anyway? We’re writing games, not a computer algebra system, after all. But wait! Taylor series are an excellent mathematical tool for approximating certain classes of functions. For instance, this is the Taylor series of sin(x) near x = 0:

\[\sin(x) = x - \dfrac{x^3}{3!} + \dfrac{x^5}{5!} - \dfrac{x^7}{7!} + \dfrac{x^9}{9!} - \dfrac{x^{11}}{11!} + \dfrac{x^{13}}{13!} - \dfrac{x^{15}}{15!} + O(x^{16})\]

Truncating the series at the 15th power will compute sin(x) with an absolute error no greater than 1e-11 in the range [-π/2; π/2], and 2e-16 in the range [-π/4; π/4].

However, a better approximation known as the minimax polynomial (probably featured in an upcoming article) will give a maximum absolute error of about 2e-16 on the whole [-π/2; π/2] range:

static double a0 = +1.0;
static double a1 = -1.666666666666580809419428987894207e-1;
static double a2 = +8.333333333262716094425037738346873e-3;
static double a3 = -1.984126982005911439283646346964929e-4;
static double a4 = +2.755731607338689220657382272783309e-6;
static double a5 = -2.505185130214293595900283001271652e-8;
static double a6 = +1.604729591825977403374012010065495e-10;
static double a7 = -7.364589573262279913270651228486670e-13;

double sin1(double x)
{
    return a0 * x
         + a1 * x * x * x
         + a2 * x * x * x * x * x
         + a3 * x * x * x * x * x * x * x
         + a4 * x * x * x * x * x * x * x * x * x
         + a5 * x * x * x * x * x * x * x * x * x * x * x
         + a6 * x * x * x * x * x * x * x * x * x * x * x * x * x
         + a7 * x * x * x * x * x * x * x * x * x * x * x * x * x * x * x;
}

That is 64 multiplications and 7 additions, though compiler options such as GCC’s -ffast-math will help factor the expression in order to perform fewer operations.

It is possible to help the CPU by noticing that a term such as x^9 can be computed in only one operation if x^2 and x^7 are already known, leading to the following code:

double sin2(double x)
{
    double ret, y = x, x2 = x * x;
    ret = a0 * y; y *= x2;
    ret += a1 * y; y *= x2;
    ret += a2 * y; y *= x2;
    ret += a3 * y; y *= x2;
    ret += a4 * y; y *= x2;
    ret += a5 * y; y *= x2;
    ret += a6 * y; y *= x2;
    ret += a7 * y;
    return ret;
}

That is now only 16 multiplications and 7 additions. But it is possible to do even better using the Horner form of a polynomial evaluation:

\[\sum_{i=0}^n a_i x^i = a_0 + x * (a_1 + x * (a_2 + x * (\dots + x * a_n)))\dots)\]

Leading to the following code:

double sin3(double x)
{
    double x2 = x * x;
    return x * (a0 + x2 * (a1 + x2 * (a2 + x2 * (a3 + x2 * (a4 + x2 * (a5 + x2 * (a6 + x2 * a7)))))));
}

We are down to 9 multiplications and 7 additions. There is probably no way to be faster, is there? Let’s see…

Timings

Here are the timings in nanoseconds for the above code, compared with the glibc’s sin() function. The test CPU is an Intel® Core™ i7-2620M CPU at 2.70GHz. The functions were compiled using -O3 -ffast-math:

function sin sin1 sin2 sin3
nanoseconds per call 22.518 16.406 16.658 25.276

Wait, what? Our superbly elegant function, performing only 9 multiplications, is actually slower than the 64-multiplication version? Which itself is as fast as the 16-multiplication one? Surely we overlooked something.

That’s right. We ignored the CPU pipeline.

The instruction pipeline

In order to execute an instruction, such as “add A and B into C”, a CPU needs to do at least the following:

  • fetch the instruction (ie. read it from the program’s memory)
  • decode the instruction
  • read the instruction’s operands (ie. A and B)
  • execute the instruction
  • write the result in memory or in registers (in our case, C)

On a modern Intel® CPU, the execution step only accounts for 1/10 or even 1/16 of the total execution time. The idea behind pipelining is simple: while executing an instruction, the CPU can often already read the operands for the next one.

But there is a problem with this strategy: if the next instruction depends on the result of the current one, the CPU cannot read the next operands yet. This is called a read-after-write hazard, and most usually causes a pipeline stall: the CPU just does nothing until it can carry on.

For the sake of simplicity, imagine the CPU’s pipeline depth is 3. At a given time, it can fetch, execute and finish one instruction:

instruction is being fetched, executed or finished
instruction could start, but needs to wait for the result of a previous instruction

This is how the CPU would execute A = (a + b) * (c + d):

time → total: 7
1 B = a + b
2 C = c + d
3 A = B * C

The c + d operation can be started very early because it does not depend on the result of a + b. This is called instruction-level parallelism. However, the final B * C operation needs to wait for all previous instructions to finish.

Since every operation in sin3() depends on the previous one, this is how it would execute that function:

time → total: 48
1 x2 = x * x
2 A = a7 * x2
3 A += a6
4 A *= x2
5 A += a5
6 A *= x2
7 A += a4
8 A *= x2
9 A += a3
10 A *= x2
11 A += a2
12 A *= x2
13 A += a1
14 A *= x2
15 A += a0
16 A *= x

These 9 multiplications and 7 additions are done in 48 units of time. No instruction-level parallelism is possible because each instruction needs to wait for the previous one to finish.

The secret behind sin2()’s performance is that the large number of independent operations allows the compiler to reorganise the computation so that the instructions can be scheduled in a much more efficient way. This is roughly how GCC compiled it:

time → total: 30
1 x2 = x * x
2 A = a7 * x2
3 x3 = x2 * x
4 A += a6
5 B = a1 * x3
6 x5 = x3 * x2
7 A *= x2
8 C = a2 * x5
9 B += x
10 x7 = x5 * x2
11 A += a5
12 D = a3 * x7
13 B += C
14 x9 = x7 * x2
15 B += D
16 E = a4 * x9
17 x11 = x9 * x2
18 B += E
19 A *= x11
20 A += B

These 13 multiplications and 7 additions are executed in 30 units of time instead of 48 for the previous version. The compiler has been rather clever here: the number of ○’s is kept small.

Note that 30 / 48 = 0.625, and the ratio between sin2 and sin3’s timings is 16.658 / 25.276 = 0.659. Reality matches theory pretty well!

Going further

We have seen that increasing the number of operations in order to break dependencies between CPU instructions allowed to help the compiler perform better optimisations taking advantage of the CPU pipeline. But that was at the cost of 40% more multiplications. Maybe there is a way to improve the scheduling without adding so many instructions?

Luckily there are other ways to evaluate a polynomial.

Even-Odd form and similar schemes

Consider our 8th order polynomial:

\[P(x) = a_0 x + a_1 x^3 + a_2 x^5 + a_3 x^7 + a_4 x^9 + a_5 x^{11} + a_6 x^{13} + a_7 x^{15}\]

Separating the odd and even coefficients, it can be rewritten as:

\[P(x) = x \left((a_0 + a_2 x^4 + a_4 x^8 + a_6 x^{12}) + x^2 (a_1 + a_3 x^4 + a_5 x^8 + a_7 x^{12})\right)\]

Which using Horner’s form yields to:

\[P(x) = x \left((a_0 + x^4 (a_2 + x^4 (a_4 + x^4 + a_6))) + x^2 (a_1 + x^4 (a_3 + x^4 (a_5 + x^4 a_7)))\right)\]

This polynomial evaluation scheme is called the Even-Odd scheme. It only has 9 multiplications and 7 additions (only one multiplication more than the optimal case). It results in the following C code:

double sin4(double x)
{
    double x2 = x * x;
    double x4 = x2 * x2;
    double A = a0 + x4 * (a2 + x4 * (a4 + x4 * a6));
    double B = a1 + x4 * (a3 + x4 * (a5 + x4 * a7));
    return x * (A + x2 * B);
}

And this is the expected scheduling:

time → total: 33
1 x2 = x * x
2 x4 = x2 * x2
3 B = a7 * x4
4 A = a6 * x4
5 B += a5
6 A += a4
7 B *= x4
8 A *= x4
9 B += a3
10 A += a2
11 B *= x4
12 A *= x4
13 B += a1
14 A += a0
15 B *= x2
16 A += B
17 A *= x

Still not good enough, but we’re certainly onto something here. Let’s try another decomposition for the polynomial:

\[P(x) = x \left((a_0 + a_3 x^6 + a_6 x^{12}) + x^2 (a_1 + a_4 x^6 + a_7 x^{12}) + x^4(a_2 + a_5 x^6)\right)\]

And using Horner’s form again:

\[P(x) = x \left((a_0 + x^6 (a_3 + x^6 a_6) + x^2 (a_1 + x^6 (a_4 + x^6 a_7)) + x^4(a_2 + x^6 a_5)\right)\]

Resulting in the following code:

double sin5(double x)
{
    double x2 = x * x;
    double x4 = x2 * x2;
    double x6 = x4 * x2;
    double A = a0 + x6 * (a3 + x6 * a6);
    double B = a1 + x6 * (a4 + x6 * a7);
    double C = a2 + x6 * a5;
    return x * (A + x2 * B + x4 * C);
}

And the following scheduling:

time → total: 31
1 x2 = x * x
2 x4 = x2 * x2
3 x6 = x4 * x2
4 B = x6 * a7
5 A = x6 * a6
6 C = x6 * a5
7 B += a4
8 A += a3
9 C += a2
10 B *= x6
11 A *= x6
12 C *= x4
13 B += a1
14 A += a0
15 B *= x2
16 A += C
17 A += B
18 A *= x

One more instruction and two units of time better. That’s slightly better, but still not as good as we would like. One problem is that a lot of time is lost waiting for the value x6 to be ready. We need to find computations to do in the meantime to avoid pipeline stalls.

High-Low form

Instead of splitting the polynomial into its even and odd coefficients, we split it into its high and low coefficients:

\[P(x) = x \left((a_0 + a_1 x^2 + a_2 x^4 + a_3 x^6) + x^8 (a_4 + a_5 x^2 + a_6 x^4 + a_7 x^6)\right)\]

And again using Horner’s form:

\[P(x) = x \left((a_0 + x^2 (a_1 + x^2 (a_2 + x^2 a_3))) + x^8 (a_4 + x^2 (a_5 + x^2 (a_6 + x^2 a_7)))\right)\]

The corresponding code is now:

double sin6(double x)
{
    double x2 = x * x;
    double x4 = x2 * x2;
    double x8 = x4 * x4;
    double A = a0 + x2 * (a1 + x2 * (a2 + x2 * a3));
    double B = a4 + x2 * (a5 + x2 * (a6 + x2 * a7));
    return x * (A + x8 * B);
}

And the expected scheduling:

time → total: 30
1 x2 = x * x
2 B = x2 * a7
3 A = x2 * a3
4 x4 = x2 * x2
5 B += a6
6 A += a2
7 x8 = x4 * x4
8 B *= x2
9 A *= x2
10 B += a5
11 A += a1
12 B *= x2
13 A *= x2
14 B += a4
15 A += a0
16 B *= x8
17 A += B
18 A *= x

Finally! We now schedule as well as GCC, and with 11 multiplications instead of 13. Still no real performance gain, though.

Pushing the limits

Can we do better? Probably. Remember that each ○ in the above table is a pipeline stall, and any instruction we would insert there would be basically free.

Note the last instruction, A *= x. It causes a stall because it needs to wait for the final value of A, but it would not be necessary if A and B had been multiplied by x beforehands.

Here is a way to do it (bold instructions indicate a new instruction or a modified one):

time → total: 27
1 x2 = x * x
2 B = x2 * a7
3 A = x2 * a3
4 x4 = x2 * x2
5 B += a6
6 A += a2
7 x8 = x4 * x4
8 B *= x2
9 A *= x2
10 x3 = x2 * x
11 B += a5
12 A += a1
13 C = a0 * x
14 B *= x2
15 A *= x3
16 x9 = x8 * x
17 B += a4
18 A += C
19 B *= x9
20 A += B

Excellent! Just as many instructions as GCC, but now with fewer pipeline stalls. I don’t know whether this scheduling is optimal for the (incorrect) assumption of a 3-stage pipeline, but it does look pretty good. Also, loading a0, a1 etc. from memory hasn't been covered for the sake of simplicity.

Anyway, we just need to write the code corresponding to this behaviour, and hope the compiler understands what we need:

double sin7(double x)
{
    double x2 = x * x;
    double x3 = x2 * x;
    double x4 = x2 * x2;
    double x8 = x4 * x4;
    double x9 = x8 * x;
    double A = x3 * (a1 + x2 * (a2 + x2 * a3));
    double B = a4 + x2 * (a5 + x2 * (a6 + x2 * a7));
    double C = a0 * x;
    return A + C + x9 * B;
}

Conclusion

It’s time to check the results! Here they are, for all the functions covered in this article:

function sin sin1 sin2 sin3 sin4 sin5 sin6 sin7
nanoseconds per call 22.518 16.406 16.658 25.276 18.666 18.582 16.366 17.470

Damn. All these efforts to understand and refactor a function, and our best effort actually performs amongst the worst!

What did we miss? Actually, this time, nothing. The problem is that GCC didn't understand what we were trying to say in sin7() and proceeded with its own optimisation ideas. Compiling with -O3 instead of -O3 -ffast-math gives a totally different set of timings:

function sin sin1 sin2 sin3 sin4 sin5 sin6 sin7
nanoseconds per call 22.497 30.250 19.865 25.279 18.587 18.958 16.362 15.891

There. We win eventually!

There is a way to still use -ffast-math yet prevent GCC from trying to be too clever. This might be preferable because we do not want to lose the benefits of -ffast-math in other places. By using an architecture-specific assembly construct, we can mark temporary variables as used, effectively telling GCC that the variable needs to be really computed and not optimised away:

double sin7(double x)
{
    double x2 = x * x;
    double x3 = x2 * x;
    double x4 = x2 * x2;
    double x8 = x4 * x4;
    double x9 = x8 * x;
#if defined __x86_64__
    __asm__("" : "+x" (x3), "+x" (x9));
#elif defined __powerpc__ || defined __powerpc64__
    __asm__("" : "+f" (x3), "+f" (x9));
#else
    __asm__("" : "+m" (x3), "+m" (x9)); /* Out of luck :-( */
#endif
    double A = x3 * (a1 + x2 * (a2 + x2 * a3));
    double B = a4 + x2 * (a5 + x2 * (a6 + x2 * a7));
    double C = a0 * x;
    return A + C + x9 * B;
}

This works on the x86_64 architecture, where "+x" indicates the SSE registers commonly used for floating point calculations, and on the PowerPC, where "+f" can be used. This approach is not portable and it is not clear what should be used on other platforms. Using "+m" is generic but often means a useless store into memory; however, on x86 it is still a noticeable gain.

And our final results, this time with the full -O3 -ffast-math optimisation flags:

function sin sin1 sin2 sin3 sin4 sin5 sin6 sin7
nanoseconds per call 22.522 16.411 16.663 25.277 18.628 18.588 16.365 15.617

The code used for this article is included in the attached file.

  • Posted: 2011-09-17 03:58 (Updated: 2014-02-04 01:30)
  • Author: sam
  • Categories: optim code
  • Comments (157)

Blog comments fixed

Posting comments was broken due to a misconfigured spam filter. Things are back to normal. That is, no one posts comments, but now they could!

Fuck you, Microsoft: reloading projects in Visual Studio

I usually develop using the default Release configuration setting in Visual Studio. It generates faster code and smaller binaries. It’s a perfectly sane thing to do, and most people do the same.

Sometimes, however, something wrong happens and I need to switch to Debug mode:

Then, in order to debug what went wrong, I open several source files, set a lot of breakpoints, study the program flow:

Quite often, I wonder: “did my coworkers commit any code that might be relevant to my problem?” and I synchronise my tree:

git pull --rebase

Or:

p4 get

This is where the nightmare begins.

If Core.vcproj was modified, the following modal dialog appears:

I click Reload. Then, if Engine.vcproj was modified, the following modal dialog appears:

Can you see where this is going? For each of the 50 projects in my solution that were modified, a modal dialog appears and I have no way to say “Yes to All”. Each and every single dialog appears.

When projects are finally reloaded, my tab line looks like this:

Fucking Visual Studio closed all the open tabs from the projects it reloaded! I have no way to reopen them as they were.

Anyway. I press F7 to rebuild the solution, and go take a drink, or switch to another task. A build takes several minutes.

When I come back, I notice this:

Fucking Visual Studio automatically switched my configuration mode back to Release! I just lost minutes of work because I needed a debug build, not a release build. And that tiny check box becomes another thing I need to constantly check in case the software attempts to change it behind my back.

So fuck you, Microsoft, for failing to handle project reloads in even the most slightly user-friendly way. And I shall no buy the “it is not a trivial thing to do” argument. When I close Visual Studio before syncing the tree, then open it again afterwards, I get no annoying avalanche of modal dialogs, my settings are in the expected configuration, and previously opened files are still here in tabs. I would totally do it if it didn't take several minutes to close and reopen the memory hog.

iTunes Store parasites

So I was looking whether the name Monsterz was already taken on the iTunes app store. Unfortunately, it was: synx.us’s Monsterz has been there for a while. It’s a rather poor game where “the aim of the game is to push as many tiles off the screen as possible within the time limit”, but well, that’s the rule: first-come, first-served.

However, I had a look at another linked application by synx.us, and discovered Baseballz, which looked quite similar. Of course it looked similar: in Baseballz, “the aim of the game is to push as many tiles off the screen as possible within the time limit”. Oh well, why not? Who would call a game Baseballz anyway?

Then I found Horrorz, where “the aim of the game is to push as many tiles off the screen as possible within the time limit”. In Soccerz, “the aim of the game is to push as many tiles off the screen as possible within the time limit”. Then came Flagz, Golfz, Razing and Eazter. Eazter? Seriously, what the fuck? And there is more! Tenniz, Valentines, St Patrick’s Day, Blox, Jox, Bikez and fucking Footballz. All these games have the exact same gameplay and even the same title screen text.

Congratulations, Apple. You won’t allow an Obama trampoline jumping app yet fifteen times the same fucking application from obvious namespace parasites is OK.

  • Posted: 2011-03-29 00:54 (Updated: 2011-03-29 00:57)
  • Author: sam
  • Categories: rant iphone
  • Comments (105)

Understanding fast float/integer conversions

If you are interested in micro-optimisation and have never studied IEEE-754 float representation, I suggest you have a look at the beast. It may give you ideas for interesting bitlevel operations. This article will cover the specific topic of conversions between integers and floats.

Note: unless you are coding for antique or very specific architectures such as the PDP-11, you may assume that the floating point storage endianness and the integer endianness match. The code presented here will therefore work flawlessly on modern CPU architectures such as x86, amd64, PowerPC or even ARM.

Introduction

Here are a few floating point values and their bitlevel representation. Notice how the different values affect the sign, exponent and mantissa fields:

sign exponent mantissa
1.0f 0 01111111 0000000 00000000 00000000
-1.0f 1 01111111 0000000 00000000 00000000
0.5f 0 01111110 0000000 00000000 00000000
0.25f 0 01111101 0000000 00000000 00000000
1.0f + 0.5f 0 01111111 1000000 00000000 00000000
1.0f + 0.25f 0 01111111 0100000 00000000 00000000

The core idea behind this article is the manipulation of the last field, the mantissa.

Byte to float conversion

A classical byte (0 - 255) to float (0.0f - 1.0f) conversion function is shown here:

float u8tofloat(uint8_t x)
{
    return (float)x * (1.0f / 255.0f);
}

This looks very simple: one conversion (fild on x86) and one multiplication (fmul on x86). However, the fild instruction has a latency such that the conversion may have a severe impact on performance.

But let’s look at these interesting floating point values:

sign exponent mantissa
32768.0f 0 10001110 0000000 00000000 00000000
32768.0f + 1.0f/256.0f 0 10001110 0000000 00000000 00000001
32768.0f + 2.0f/256.0f 0 10001110 0000000 00000000 00000010
32768.0f + 255.0f/256.0f 0 10001110 0000000 00000000 11111111

Notice the last eight bits? They look almost exactly like the input byte expected by u8tofloat. Taking advantage of the binary representation allows us to write the following conversion function:

float u8tofloat_trick(uint8_t x)
{
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return u.f - 32768.0f;
}

When used in a CPU-intensive loop, this method can be up to twice as fast as the previous implementation, for instance on the amd64 architecture. On the x86 architecture, the difference is far less noticeable.

You probably noticed that the output range is 0.0f - 255.0f/256.0f instead of 0.0f - 1.0f. This may be preferred in some cases when the value is supposed to wrap around. However, colour coordinates will require exact 0.0f - 1.0f bounds. This is easily fixed with an additional multiplication:

float u8tofloat_trick2(uint8_t x)
{
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return (u.f - 32768.0f) * (256.0f / 255.0f);
}

This can still be up to twice as fast than the original integer to float cast.

Short to float conversion

The usual way to convert a 16-bit integer to a float will be:

float u16tofloat(uint16_t x)
{
    return (float)x * (1.0f / 65535.0f);
}

Again, careful observation of the following floats will be useful:

sign exponent mantissa
16777216.0f 0 10010111 0000000 00000000 00000000
16777216.0f + 1.0f/65536.0f 0 10010111 0000000 00000000 00000001
16777216.0f + 2.0f/65536.0f 0 10010111 0000000 00000000 00000010
16777216.0f + 65535.0f/65536.0f 0 10010111 0000000 11111111 11111111

And the resulting conversion method:

float u16tofloat_trick(uint16_t x)
{
    union { float f; uint32_t i; } u; u.f = 16777216.0f; u.i |= x;
    return u.f - 16777216.0f; // optionally: (u.f - 16777216.0f) * (65536.0f / 65535.0f)
}

However, due to the size of the input data, the performance gain here can be much less visible. Be sure to properly benchmark.

Int to float conversion

The above techniques cannot be directly applied to 32-bit integers because floats only have a 23-bit mantissa. Several methods are possible:

  • Use the double type instead of float. They have a 52-bit mantissa.
  • Reduce the input int precision to 23 bits.

Float to int conversion

Finally, the exact same technique can be used for the inverse conversion. This is the naive implementation:

static inline uint8_t u8fromfloat(float x)
{
    return (int)(x * 255.0f);
}

Clamping is left as an exercise to the reader. Also note that a value such as 255.99999f will ensure better distribution and avoid singling out the 1.0f value.

And our now familiar bitlevel trick:

static inline uint8_t u8fromfloat_trick(float x)
{
    union { float f; uint32_t i; } u;
    u.f = 32768.0f + x * (255.0f / 256.0f);
    return (uint8_t)u.i;
}

Unfortunately, this is usually a performance hit on amd64. However, on x86, it is up to three time as fast as the original. Choose wisely!

The LUT strategy

Some will point out that using a lookup table is much faster.

float lut[256];
void fill_lut()
{
    for (int n = 0; n < 256; n++) lut[n] = (float)n / 255.0f;
}
float u8tofloat_lut(uint8_t x)
{
    return lut[x];
}

This is indeed faster in many cases and should not be overlooked. However, the following should be taken into account:

  • LUTs are fast, but if unlucky, the cache may get in your way and cause performance issues
  • the LUT approach is actually almost always slower with 16-bit input, because the size of the table starts messing with the cache
  • do not underestimate the time needed to fill the LUT, especially if different conversions need to be performed, requiring several LUTs
  • LUTs do not mix well with SIMD instructions
  • obviously, this method doesn’t work with float to int conversions

Last warnings

Many programmers will be tempted to write shorter code such as:

float u8tofloat_INVALID(uint8_t x)
{
    // NO! DON’T DO THIS! EVER!
    float f = 32768.0f; *(uint32_t *)&f |= x;
    return f - 32768.0f;
}

Do not do this, ever! I guarantee that this will break in very nasty and unexpected places. Strict C and C++ aliasing rules make it illegal to have a pointer to a float also be a pointer to an integer. The only legal way to do this is to use a union (actually, this is still not legal by the C++ standard but most real-life compilers allow this type punning through documented extensions).

Finally, one last, obvious tip: always measure the effects of an optimisation before deciding to use it!

Fuck you, Microsoft: the environment variable windows

Why, oh why, is the environment variable window (accessible from the system preferences) such an atrocious experience, beyond the limits of human pain tolerance?

Why does the window not have a fucking resize handle? Why do I have to click on scrollbar handles smaller than my fucking mouse pointer to browse my environment variables? Why is there no way to search or replace strings? Why is the intern who wrote this GUI probably a top executive now?

Build and run Android NDK applications without Eclipse

If you already have a development environment and do not wish to use Eclipse, you can easily build and run your NDK application from makefiles or the command line.

First of all, you need to set the ANDROID_NDK_ROOT environment variable and ensure the SDK and NDK binary directories are in PATH. Here are my definitions:

ANDROID_NDK_ROOT=/home/sam/android/android-ndk-r8
PATH="$PATH:$ANDROID_NDK_ROOT"
PATH="$PATH:/home/sam/android/android-sdk-linux_x86/platform-tools"
PATH="$PATH:/home/sam/android/android-sdk-linux_x86/tools"

This is best defined in one of your shell’s startup scripts such as .zshenv.

Build and install package

Now, whenever you are in an NDK project’s directory, build the project using:

ndk-build && ant release

And to upload it to the emulator or to a connected device:

ant release install

That’s all! Those two simple commands can easily be launched from your preferred development environment.

Update: ant compile no longer exists in recent SDKs; replaced with ant release.

Run package

You can use adb to run any application remotely. For instance:

adb shell am start -a android.intent.action.MAIN -n $PACKAGENAME/.$ACTIVITYNAME

Both package name and activity name can be found in your AndroidManifest.xml.

Fuck you, Microsoft: near and far macros

If you target the Windows platform, chances are that your code will have this:

#include <windows.h>

Which in turns includes <windef.h>, which unconditionally defines the following macros:

#define far
#define near

Right. Because there’s no chance in hell that, writing 3D code for Windows, someone’s gonna name any of their variables near or far. Never happens. Never will.

Fuck you, Microsoft, for not even providing a way to disable that monstrosity with a global macro such as NOFUCKINGMACROSFROMTHEEIGHTIES but instead requiring me to #undef those macros after each inclusion of <windows.h>. And it’s not like you don’t know how to do that, because you provide NOMINMAX which deactivates your min() and max() macros in the same fucking file. Fuck you for silently breaking code that compiles cleanly on every platform, Mac OS X, Android or the Playstation.

I refuse to be swayed by your terror tactics and name my variables m_fNearPlaneClipDistance or whatever deranged mind decides is better. My near and far values are called near and far, because I love this naming scheme, and if you don’t, fuck you and your fat wife.

Load PNGs from assets using Android NDK

Many developers appear to embed libpng with their NDK project in order to decode PNGs. While libpng does offer great flexibility, the amount of code necessary to decode an image is surprisingly high, and the additional work needed to maintain a libpng build means that most of the time, using the system’s decoding routines is perfectly reasonable.

But wait, isn’t the NDK for C++ development only? True, but usually we are still running in a virtual machine that has access to a large panel of high-level utility libraries. This article actually demonstrates a broader, useful technique I call return-to-JVM that you can use for other purposes than simply PNG loading.

I suggest putting your PNG files in the assets directory of your application, so that they can be accessed by path.

First, let’s decide of a Java class and object that will act as a PNG factory and manager for us. Let’s call it PngManager:

import android.content.res.AssetManager;

public class PngManager
{
    private AssetManager amgr;

    public Bitmap open(String path)
    {
        try
        {
            return BitmapFactory.decodeStream(amgr.open(path));
        }
        catch (Exception e) { }
        return null;
    }

    public int getWidth(Bitmap bmp) { return bmp.getWidth(); }
    public int getHeight(Bitmap bmp) { return bmp.getHeight(); }

    public void getPixels(Bitmap bmp, int[] pixels)
    {
        int w = bmp.getWidth();
        int h = bmp.getHeight();
        bmp.getPixels(pixels, 0, w, 0, 0, w, h);
    }

    public void close(Bitmap bmp)
    {
        bmp.recycle();
    }
}

Now to load the PNG from the C++ part of the program, use the following code:

jobject g_pngmgr;
JNIEnv *g_env;

/* ... */

char const *path = "images/myimage.png";

jclass cls = g_env->GetObjectClass(g_pngmgr);
jmethodID mid;

/* Ask the PNG manager for a bitmap */
mid = g_env->GetMethodID(cls, "open",
                         "(Ljava/lang/String;)Landroid/graphics/Bitmap;");
jstring name = g_env->NewStringUTF(path);
jobject png = g_env->CallObjectMethod(g_pngmgr, mid, name);
g_env->DeleteLocalRef(name);
g_env->NewGlobalRef(png);

/* Get image dimensions */
mid = g_env->GetMethodID(cls, "getWidth", "(Landroid/graphics/Bitmap;)I");
int width = g_env->CallIntMethod(g_pngmgr, mid, png);
mid = g_env->GetMethodID(cls, "getHeight", "(Landroid/graphics/Bitmap;)I");
int height = g_env->CallIntMethod(g_pngmgr, mid, png);

/* Get pixels */
jintArray array = g_env->NewIntArray(width * height);
g_env->NewGlobalRef(array);
mid = g_env->GetMethodID(cls, "getPixels", "(Landroid/graphics/Bitmap;[I)V");
g_env->CallVoidMethod(g_pngmgr, mid, png, array);

jint *pixels = g_env->GetIntArrayElements(array, 0);

Now do anything you want with the pixels, for instance bind them to a texture.

And to release the bitmap when finished:

g_env->ReleaseIntArrayElements(array, pixels, 0);
g_env->DeleteGlobalRef(array);

/* Free image */
mid = g_env->GetMethodID(cls, "close", "(Landroid/graphics/Bitmap;)V");
g_env->CallVoidMethod(g_pngmgr, mid, png);
g_env->DeleteGlobalRef(png);

This will not work out of the box. There are a few last things to do, which will hugely depend on your global application architecture and are thus left as an exercise to the reader:

  • Store an AssetManager object in PngManager::amgr before the first call to open() is made (for instance by calling Activity::getAssets() upon application initialisation).
  • Store in g_env a valid JNIEnv * value (the JNI environment is the first argument to all JNI methods), either by remembering it or by using jvm->AttachCurrentThread().
  • Store in g_pngmgr a valid jobject handle to a PngManager instance (for instance by calling a JNI method with the instance as an argument).
  • Error checking was totally omitted from the code for the sake of clarity.
  • Some of the dynamically retrieved variables could benefit from being cached.

I hope this can prove helpful!

For a C++-only solution to this problem, see Load pngs from assets in NDK by Bill Hsu.

The Lol Engine Blog

This blog will contain various articles about the development of the Lol Engine, game development and development in general.

Latest posts

  • Posted: 2011-03-02 02:01 (Updated: 2011-03-22 14:35)
  • Author: sam
  • Categories: blog
  • Comments (1534)