Understanding fast float/integer conversions

If you are interested in micro-optimisation and have never studied IEEE-754 float representation, I suggest you have a look at the beast. It may give you ideas for interesting bitlevel operations. This article will cover the specific topic of conversions between integers and floats.

Note: unless you are coding for antique or very specific architectures such as the PDP-11, you may assume that the floating point storage endianness and the integer endianness match. The code presented here will therefore work flawlessly on modern CPU architectures such as x86, amd64, PowerPC or even ARM.

Introduction

Here are a few floating point values and their bitlevel representation. Notice how the different values affect the sign, exponent and mantissa fields:

sign exponent mantissa
1.0f 0 01111111 0000000 00000000 00000000
-1.0f 1 01111111 0000000 00000000 00000000
0.5f 0 01111110 0000000 00000000 00000000
0.25f 0 01111101 0000000 00000000 00000000
1.0f + 0.5f 0 01111111 1000000 00000000 00000000
1.0f + 0.25f 0 01111111 0100000 00000000 00000000

The core idea behind this article is the manipulation of the last field, the mantissa.

Byte to float conversion

A classical byte (0 - 255) to float (0.0f - 1.0f) conversion function is shown here:

float u8tofloat(uint8_t x)
{
    return (float)x * (1.0f / 255.0f);
}

This looks very simple: one conversion (fild on x86) and one multiplication (fmul on x86). However, the fild instruction has a latency such that the conversion may have a severe impact on performance.

But let’s look at these interesting floating point values:

sign exponent mantissa
32768.0f 0 10001110 0000000 00000000 00000000
32768.0f + 1.0f/256.0f 0 10001110 0000000 00000000 00000001
32768.0f + 2.0f/256.0f 0 10001110 0000000 00000000 00000010
32768.0f + 255.0f/256.0f 0 10001110 0000000 00000000 11111111

Notice the last eight bits? They look almost exactly like the input byte expected by u8tofloat. Taking advantage of the binary representation allows us to write the following conversion function:

float u8tofloat_trick(uint8_t x)
{
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return u.f - 32768.0f;
}

When used in a CPU-intensive loop, this method can be up to twice as fast as the previous implementation, for instance on the amd64 architecture. On the x86 architecture, the difference is far less noticeable.

You probably noticed that the output range is 0.0f - 255.0f/256.0f instead of 0.0f - 1.0f. This may be preferred in some cases when the value is supposed to wrap around. However, colour coordinates will require exact 0.0f - 1.0f bounds. This is easily fixed with an additional multiplication:

float u8tofloat_trick2(uint8_t x)
{
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return (u.f - 32768.0f) * (256.0f / 255.0f);
}

This can still be up to twice as fast than the original integer to float cast.

Short to float conversion

The usual way to convert a 16-bit integer to a float will be:

float u16tofloat(uint16_t x)
{
    return (float)x * (1.0f / 65535.0f);
}

Again, careful observation of the following floats will be useful:

sign exponent mantissa
16777216.0f 0 10010111 0000000 00000000 00000000
16777216.0f + 1.0f/65536.0f 0 10010111 0000000 00000000 00000001
16777216.0f + 2.0f/65536.0f 0 10010111 0000000 00000000 00000010
16777216.0f + 65535.0f/65536.0f 0 10010111 0000000 11111111 11111111

And the resulting conversion method:

float u16tofloat_trick(uint16_t x)
{
    union { float f; uint32_t i; } u; u.f = 16777216.0f; u.i |= x;
    return u.f - 16777216.0f; // optionally: (u.f - 16777216.0f) * (65536.0f / 65535.0f)
}

However, due to the size of the input data, the performance gain here can be much less visible. Be sure to properly benchmark.

Int to float conversion

The above techniques cannot be directly applied to 32-bit integers because floats only have a 23-bit mantissa. Several methods are possible:

  • Use the double type instead of float. They have a 52-bit mantissa.
  • Reduce the input int precision to 23 bits.

Float to int conversion

Finally, the exact same technique can be used for the inverse conversion. This is the naive implementation:

static inline uint8_t u8fromfloat(float x)
{
    return (int)(x * 255.0f);
}

Clamping is left as an exercise to the reader. Also note that a value such as 255.99999f will ensure better distribution and avoid singling out the 1.0f value.

And our now familiar bitlevel trick:

static inline uint8_t u8fromfloat_trick(float x)
{
    union { float f; uint32_t i; } u;
    u.f = 32768.0f + x * (255.0f / 256.0f);
    return (uint8_t)u.i;
}

Unfortunately, this is usually a performance hit on amd64. However, on x86, it is up to three time as fast as the original. Choose wisely!

The LUT strategy

Some will point out that using a lookup table is much faster.

float lut[256];
void fill_lut()
{
    for (int n = 0; n < 256; n++) lut[n] = (float)n / 255.0f;
}
float u8tofloat_lut(uint8_t x)
{
    return lut[x];
}

This is indeed faster in many cases and should not be overlooked. However, the following should be taken into account:

  • LUTs are fast, but if unlucky, the cache may get in your way and cause performance issues
  • the LUT approach is actually almost always slower with 16-bit input, because the size of the table starts messing with the cache
  • do not underestimate the time needed to fill the LUT, especially if different conversions need to be performed, requiring several LUTs
  • LUTs do not mix well with SIMD instructions
  • obviously, this method doesn’t work with float to int conversions

Last warnings

Many programmers will be tempted to write shorter code such as:

float u8tofloat_INVALID(uint8_t x)
{
    // NO! DON’T DO THIS! EVER!
    float f = 32768.0f; *(uint32_t *)&f |= x;
    return f - 32768.0f;
}

Do not do this, ever! I guarantee that this will break in very nasty and unexpected places. Strict C and C++ aliasing rules make it illegal to have a pointer to a float also be a pointer to an integer. The only legal way to do this is to use a union (actually, this is still not legal by the C++ standard but most real-life compilers allow this type punning through documented extensions).

Finally, one last, obvious tip: always measure the effects of an optimisation before deciding to use it!

  • Posted: 2011-03-20 18:55 (Updated: 2012-02-26 17:55)
  • Author: sam
  • Categories: optim code tip

Comments

1. johan.boule@retropaganda.info -- 2011-07-04 21:46

I've tried to measure the performance gain with the following program, compiled with gcc 4.4.3 on an intel core2 quad (4 cores), linux x86_64. This is the output i get with various compiler flags:

	# g++ -std=gnu++0x -march=native -O3 -funroll-loops -fopenmp int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 0.777653s
	specialized  : sum: 4.29497e+09, time: 0.792678s
	standard     : sum: 4.29497e+09, time: 0.793893s
	# g++ -std=gnu++0x -march=native -O3 -fopenmp int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 0.797142s
	specialized  : sum: 4.29497e+09, time: 1.11614s
	standard     : sum: 4.29497e+09, time: 0.788671s
	a.out: int_to_float_test.cpp:65: int main(): Assertion `t2 - t1 < t3 - t2' failed.
	# g++ -std=gnu++0x -march=native -O3 -funroll-loops int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 3.06334s
	specialized  : sum: 4.29497e+09, time: 3.16423s
	standard     : sum: 4.29497e+09, time: 3.14101s
	a.out: int_to_float_test.cpp:65: int main(): Assertion `t2 - t1 < t3 - t2' failed.
	# g++ -std=gnu++0x -march=native -O3 int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 3.17681s
	specialized  : sum: 4.29497e+09, time: 4.46758s
	standard     : sum: 4.29497e+09, time: 3.14915s
	a.out: int_to_float_test.cpp:65: int main(): Assertion `t2 - t1 < t3 - t2' failed.

The test program:

#include <chrono>
#include <limits>
#include <iostream>
#undef NDEBUG
#include <cassert>
float inline uint8_t_to_float(uint8_t x) {
        union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
        return (u.f - 32768.0f) * 256.0f ;
}
int main() {
        typedef std::uint8_t int_type;
        typedef float real;
        typedef std::chrono::high_resolution_clock clock;
        std::size_t const iterations = 10000000;
        typedef std::numeric_limits<int_type> limits;
        clock::time_point const t0 = clock::now();
        // without conversion
        real r0(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for(real r = limits::min(); r < limits::max(); ++r)
                        r0 += r;
        clock::time_point const t1 = clock::now();
        // specialized conversion
        real r1(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for(int_type i = limits::min(); i < limits::max(); ++i)
                        r1 += uint8_t_to_float(i);
        clock::time_point const t2 = clock::now();
        // standard conversion
        real r2(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for(int_type i = limits::min(); i < limits::max(); ++i)
                        r2 += i;
        clock::time_point const t3 = clock::now();
        std::cout <<
                "no conversion: sum: " << r0 << ", time: " << std::chrono::nanoseconds(t1 - t0).count() * 1e-9 << "s\n"
                "specialized  : sum: " << r1 << ", time: " << std::chrono::nanoseconds(t2 - t1).count() * 1e-9 << "s\n"
                "standard     : sum: " << r2 << ", time: " << std::chrono::nanoseconds(t3 - t2).count() * 1e-9 << "s\n";
        assert(r0 == r1 && r1 == r2);
        assert(t1 - t0 < t2 - t1);
        assert(t2 - t1 < t3 - t2);
}
2. sam -- 2011-07-10 13:16

@Johan: this is interesting indeed, and clearly proves the need to benchmark on many platforms and with many compiler flags. I have several comments:

  • The use of #pragma omp seems invalid here, at least with the versions of GCC I could test (4.4 and 4.6). I tried with iterations = 100 and all of r0, r1 and r2 had invalid values.
  • Using 32768.0f and 256.0f as conversion constants is not optimal; you could use 1.0f and 8388608.0f (because loading 1.0f in a floating point register is cheap) or, even better, 8388608.0f and 1.0f (thus avoiding the final multiplication).
  • I think using the same variable for the loop index and the value to convert gives the compiler too many opportunities to optimise; reading the value from a table seems worth looking at, too.

Here are my timings for another test program, first on a Core-i7 2600 @3.40 GHz:

# g++ -std=gnu++0x -march=native -O3 u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.224297s
standard     : sum: 4.29497e+09, time: 0.297493s
# g++ -std=gnu++0x -march=native -O3 -funroll-loops u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.223372s
standard     : sum: 4.29497e+09, time: 0.223082s

And this is on an Opteron 275:

# g++ -std=gnu++0x -march=native -O3 u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.511495s
standard     : sum: 4.29497e+09, time: 0.507289s
# g++ -std=gnu++0x -march=native -O3 -funroll-loops u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.484214s
standard     : sum: 4.29497e+09, time: 0.517303s

The results are mixed. On a Core i7 there is a solid 20 to 30% gain, but it disappears when -funroll-loops is used. On the Opteron the specialized method is slightly slower, but becomes faster when -funroll-loops is used.

So, no clear winner. It will depend the situation.

And this is the code, adapted from yours:

#include <chrono>
#include <iostream>
#undef NDEBUG
#include <cassert>
float inline uint8_t_to_float(uint8_t x) {
        union { float f; uint32_t i; } u = { 8388608.0f }; u.i |= x;
        return u.f - 8388608.0f;
}
int main() {
        typedef std::uint8_t int_type;
        typedef float real;
        typedef std::chrono::high_resolution_clock clock;
        std::size_t const datasize = 200000;
        std::size_t const iterations = 1000;
        int_type *data = new int_type[datasize];
        for (std::size_t i = 0; i < datasize; ++i)
                data[i] = static_cast<int_type>((i * 17) ^ 101);
        clock::time_point const t0 = clock::now();
        // specialized conversion
        real r1(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for (std::size_t i = 0; i < datasize; ++i)
                        r1 += uint8_t_to_float(data[i]);
        clock::time_point const t1 = clock::now();
        // standard conversion
        real r2(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for (std::size_t i = 0; i < datasize; ++i)
                        r2 += static_cast<real>(data[i]);
        clock::time_point const t2 = clock::now();
        std::cout <<
                "specialized  : sum: " << r1 << ", time: " << std::chrono::nanoseconds(t1 - t0).count() * 1e-9 << "s\n"
                "standard     : sum: " << r2 << ", time: " << std::chrono::nanoseconds(t2 - t1).count() * 1e-9 << "s\n";
        assert(r1 == r2);
        delete data;
}
3. anonymous -- 2012-04-13 10:38

The union trick, still supports by almost every compiler is also invalid in regards to the strict aliasing rules !

The standard say that only the union initialised member can be accessed. The only valid way to copy from int to float and float to int is to memcpy by casting the variable adress to (char*), as it is the only guarantee behavior.

Btw, union works, and will probably works in futur for sure, so who care, just have in mind that PPC will not be very happy on the performance side with that kind of tricks and you will suffers from heavy LHS issues

4. sam -- 2012-04-15 01:42

@anonymous: the C standard (C99, TC3, section 6.5.2.3) explicitly says it's legal regardless of the aliasing rules. You are right that the union is not legally supposed to work in C++ (hence the need for reinterpret_cast).

5. john -- 2015-07-08 11:47

It's easy to extend the code to convert signed integers to float and double. Here are two very fast examples using gcc x86-64 assembler (Linux, AT&T syntax).

converting signed integers to float (-223 < i < +223)

        mov      %edi, %eax                // EDI, EAX = signed integer
        mov      $0x4B000000, %ecx         // ECX = 8388608.0f (2^23)
        cdq                                // EDX = signbits
        neg      %edi                      // EDI = -integer
        and      $0x80000000, %edx         // EDX = sign
        cmovne   %edi, %eax                // EAX = mantissa
        or       %edx, %ecx                // ECX = sign | exponent (+-)2^23
        movd     %ecx, %xmm1               // XMM1 = (+-)8388608.0f
        or       %ecx, %eax                // EAX = transition float
        movd     %eax, %xmm0
        subss    %xmm1, %xmm0              // XMM0 = float value
        ...

converting signed integers to double (-252 < i < +252)

        mov      %rdi, %rax                // RDI, RAX = signed integer
        mov      $0x4330000000000000, %rcx // RCX = 4503599627370496.0 (2^52)
        mov      $0x8000000000000000, %rsi // RSI = sign mask
        cqo                                // RDX = signbits
        neg      %rdi                      // RDI = -integer
        and      %rsi, %rdx                // RDX = sign
        cmovne   %rdi, %rax                // RAX = mantissa
        or       %rdx, %rcx                // RCX = sign | exponent (+-)2^52
        movq     %rcx, %xmm1               // XMM1 = (+-)4503599627370496.0
        or       %rcx, %rax                // RAX = transition double
        movq     %rax, %xmm0
        subsd    %xmm1, %xmm0              // XMM0 = double value
        ...

Add New Comment