Understanding fast float/integer conversions

If you are interested in micro-optimisation and have never studied IEEE-754 float representation, I suggest you have a look at the beast. It may give you ideas for interesting bitlevel operations. This article will cover the specific topic of conversions between integers and floats.

Note: unless you are coding for antique or very specific architectures such as the PDP-11, you may assume that the floating point storage endianness and the integer endianness match. The code presented here will therefore work flawlessly on modern CPU architectures such as x86, amd64, PowerPC or even ARM.


Here are a few floating point values and their bitlevel representation. Notice how the different values affect the sign, exponent and mantissa fields:

sign exponent mantissa
1.0f 0 01111111 0000000 00000000 00000000
-1.0f 1 01111111 0000000 00000000 00000000
0.5f 0 01111110 0000000 00000000 00000000
0.25f 0 01111101 0000000 00000000 00000000
1.0f + 0.5f 0 01111111 1000000 00000000 00000000
1.0f + 0.25f 0 01111111 0100000 00000000 00000000

The core idea behind this article is the manipulation of the last field, the mantissa.

Byte to float conversion

A classical byte (0 - 255) to float (0.0f - 1.0f) conversion function is shown here:

float u8tofloat(uint8_t x)
    return (float)x * (1.0f / 255.0f);

This looks very simple: one conversion (fild on x86) and one multiplication (fmul on x86). However, the fild instruction has a latency such that the conversion may have a severe impact on performance.

But let’s look at these interesting floating point values:

sign exponent mantissa
32768.0f 0 10001110 0000000 00000000 00000000
32768.0f + 1.0f/256.0f 0 10001110 0000000 00000000 00000001
32768.0f + 2.0f/256.0f 0 10001110 0000000 00000000 00000010
32768.0f + 255.0f/256.0f 0 10001110 0000000 00000000 11111111

Notice the last eight bits? They look almost exactly like the input byte expected by u8tofloat. Taking advantage of the binary representation allows us to write the following conversion function:

float u8tofloat_trick(uint8_t x)
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return u.f - 32768.0f;

When used in a CPU-intensive loop, this method can be up to twice as fast as the previous implementation, for instance on the amd64 architecture. On the x86 architecture, the difference is far less noticeable.

You probably noticed that the output range is 0.0f - 255.0f/256.0f instead of 0.0f - 1.0f. This may be preferred in some cases when the value is supposed to wrap around. However, colour coordinates will require exact 0.0f - 1.0f bounds. This is easily fixed with an additional multiplication:

float u8tofloat_trick2(uint8_t x)
    union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
    return (u.f - 32768.0f) * (256.0f / 255.0f);

This can still be up to twice as fast than the original integer to float cast.

Short to float conversion

The usual way to convert a 16-bit integer to a float will be:

float u16tofloat(uint16_t x)
    return (float)x * (1.0f / 65535.0f);

Again, careful observation of the following floats will be useful:

sign exponent mantissa
16777216.0f 0 10010111 0000000 00000000 00000000
16777216.0f + 1.0f/65536.0f 0 10010111 0000000 00000000 00000001
16777216.0f + 2.0f/65536.0f 0 10010111 0000000 00000000 00000010
16777216.0f + 65535.0f/65536.0f 0 10010111 0000000 11111111 11111111

And the resulting conversion method:

float u16tofloat_trick(uint16_t x)
    union { float f; uint32_t i; } u; u.f = 16777216.0f; u.i |= x;
    return u.f - 16777216.0f; // optionally: (u.f - 16777216.0f) * (65536.0f / 65535.0f)

However, due to the size of the input data, the performance gain here can be much less visible. Be sure to properly benchmark.

Int to float conversion

The above techniques cannot be directly applied to 32-bit integers because floats only have a 23-bit mantissa. Several methods are possible:

  • Use the double type instead of float. They have a 52-bit mantissa.
  • Reduce the input int precision to 23 bits.

Float to int conversion

Finally, the exact same technique can be used for the inverse conversion. This is the naive implementation:

static inline uint8_t u8fromfloat(float x)
    return (int)(x * 255.0f);

Clamping is left as an exercise to the reader. Also note that a value such as 255.99999f will ensure better distribution and avoid singling out the 1.0f value.

And our now familiar bitlevel trick:

static inline uint8_t u8fromfloat_trick(float x)
    union { float f; uint32_t i; } u;
    u.f = 32768.0f + x * (255.0f / 256.0f);
    return (uint8_t)u.i;

Unfortunately, this is usually a performance hit on amd64. However, on x86, it is up to three time as fast as the original. Choose wisely!

The LUT strategy

Some will point out that using a lookup table is much faster.

float lut[256];
void fill_lut()
    for (int n = 0; n < 256; n++) lut[n] = (float)n / 255.0f;
float u8tofloat_lut(uint8_t x)
    return lut[x];

This is indeed faster in many cases and should not be overlooked. However, the following should be taken into account:

  • LUTs are fast, but if unlucky, the cache may get in your way and cause performance issues
  • the LUT approach is actually almost always slower with 16-bit input, because the size of the table starts messing with the cache
  • do not underestimate the time needed to fill the LUT, especially if different conversions need to be performed, requiring several LUTs
  • LUTs do not mix well with SIMD instructions
  • obviously, this method doesn’t work with float to int conversions

Last warnings

Many programmers will be tempted to write shorter code such as:

float u8tofloat_INVALID(uint8_t x)
    float f = 32768.0f; *(uint32_t *)&f |= x;
    return f - 32768.0f;

Do not do this, ever! I guarantee that this will break in very nasty and unexpected places. Strict C and C++ aliasing rules make it illegal to have a pointer to a float also be a pointer to an integer. The only legal way to do this is to use a union (actually, this is still not legal by the C++ standard but most real-life compilers allow this type punning through documented extensions).

Finally, one last, obvious tip: always measure the effects of an optimisation before deciding to use it!

  • Posted: 2011-03-20 18:55 (Updated: 2012-02-26 17:55)
  • Author: sam
  • Categories: optim code tip


1. johan.boule@retropaganda.info -- 2011-07-04 21:46

I've tried to measure the performance gain with the following program, compiled with gcc 4.4.3 on an intel core2 quad (4 cores), linux x86_64. This is the output i get with various compiler flags:

	# g++ -std=gnu++0x -march=native -O3 -funroll-loops -fopenmp int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 0.777653s
	specialized  : sum: 4.29497e+09, time: 0.792678s
	standard     : sum: 4.29497e+09, time: 0.793893s
	# g++ -std=gnu++0x -march=native -O3 -fopenmp int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 0.797142s
	specialized  : sum: 4.29497e+09, time: 1.11614s
	standard     : sum: 4.29497e+09, time: 0.788671s
	a.out: int_to_float_test.cpp:65: int main(): Assertion `t2 - t1 < t3 - t2' failed.
	# g++ -std=gnu++0x -march=native -O3 -funroll-loops int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 3.06334s
	specialized  : sum: 4.29497e+09, time: 3.16423s
	standard     : sum: 4.29497e+09, time: 3.14101s
	a.out: int_to_float_test.cpp:65: int main(): Assertion `t2 - t1 < t3 - t2' failed.
	# g++ -std=gnu++0x -march=native -O3 int_to_float_test.cpp && ./a.out
	no conversion: sum: 4.29497e+09, time: 3.17681s
	specialized  : sum: 4.29497e+09, time: 4.46758s
	standard     : sum: 4.29497e+09, time: 3.14915s
	a.out: int_to_float_test.cpp:65: int main(): Assertion `t2 - t1 < t3 - t2' failed.

The test program:

#include <chrono>
#include <limits>
#include <iostream>
#undef NDEBUG
#include <cassert>
float inline uint8_t_to_float(uint8_t x) {
        union { float f; uint32_t i; } u; u.f = 32768.0f; u.i |= x;
        return (u.f - 32768.0f) * 256.0f ;
int main() {
        typedef std::uint8_t int_type;
        typedef float real;
        typedef std::chrono::high_resolution_clock clock;
        std::size_t const iterations = 10000000;
        typedef std::numeric_limits<int_type> limits;
        clock::time_point const t0 = clock::now();
        // without conversion
        real r0(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for(real r = limits::min(); r < limits::max(); ++r)
                        r0 += r;
        clock::time_point const t1 = clock::now();
        // specialized conversion
        real r1(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for(int_type i = limits::min(); i < limits::max(); ++i)
                        r1 += uint8_t_to_float(i);
        clock::time_point const t2 = clock::now();
        // standard conversion
        real r2(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for(int_type i = limits::min(); i < limits::max(); ++i)
                        r2 += i;
        clock::time_point const t3 = clock::now();
        std::cout <<
                "no conversion: sum: " << r0 << ", time: " << std::chrono::nanoseconds(t1 - t0).count() * 1e-9 << "s\n"
                "specialized  : sum: " << r1 << ", time: " << std::chrono::nanoseconds(t2 - t1).count() * 1e-9 << "s\n"
                "standard     : sum: " << r2 << ", time: " << std::chrono::nanoseconds(t3 - t2).count() * 1e-9 << "s\n";
        assert(r0 == r1 && r1 == r2);
        assert(t1 - t0 < t2 - t1);
        assert(t2 - t1 < t3 - t2);
2. sam -- 2011-07-10 13:16

@Johan: this is interesting indeed, and clearly proves the need to benchmark on many platforms and with many compiler flags. I have several comments:

  • The use of #pragma omp seems invalid here, at least with the versions of GCC I could test (4.4 and 4.6). I tried with iterations = 100 and all of r0, r1 and r2 had invalid values.
  • Using 32768.0f and 256.0f as conversion constants is not optimal; you could use 1.0f and 8388608.0f (because loading 1.0f in a floating point register is cheap) or, even better, 8388608.0f and 1.0f (thus avoiding the final multiplication).
  • I think using the same variable for the loop index and the value to convert gives the compiler too many opportunities to optimise; reading the value from a table seems worth looking at, too.

Here are my timings for another test program, first on a Core-i7 2600 @3.40 GHz:

# g++ -std=gnu++0x -march=native -O3 u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.224297s
standard     : sum: 4.29497e+09, time: 0.297493s
# g++ -std=gnu++0x -march=native -O3 -funroll-loops u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.223372s
standard     : sum: 4.29497e+09, time: 0.223082s

And this is on an Opteron 275:

# g++ -std=gnu++0x -march=native -O3 u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.511495s
standard     : sum: 4.29497e+09, time: 0.507289s
# g++ -std=gnu++0x -march=native -O3 -funroll-loops u8tofloat.cpp && ./a.out
specialized  : sum: 4.29497e+09, time: 0.484214s
standard     : sum: 4.29497e+09, time: 0.517303s

The results are mixed. On a Core i7 there is a solid 20 to 30% gain, but it disappears when -funroll-loops is used. On the Opteron the specialized method is slightly slower, but becomes faster when -funroll-loops is used.

So, no clear winner. It will depend the situation.

And this is the code, adapted from yours:

#include <chrono>
#include <iostream>
#undef NDEBUG
#include <cassert>
float inline uint8_t_to_float(uint8_t x) {
        union { float f; uint32_t i; } u = { 8388608.0f }; u.i |= x;
        return u.f - 8388608.0f;
int main() {
        typedef std::uint8_t int_type;
        typedef float real;
        typedef std::chrono::high_resolution_clock clock;
        std::size_t const datasize = 200000;
        std::size_t const iterations = 1000;
        int_type *data = new int_type[datasize];
        for (std::size_t i = 0; i < datasize; ++i)
                data[i] = static_cast<int_type>((i * 17) ^ 101);
        clock::time_point const t0 = clock::now();
        // specialized conversion
        real r1(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for (std::size_t i = 0; i < datasize; ++i)
                        r1 += uint8_t_to_float(data[i]);
        clock::time_point const t1 = clock::now();
        // standard conversion
        real r2(0);
        #pragma omp parallel for
        for(std::size_t count = 0; count < iterations; ++count)
                for (std::size_t i = 0; i < datasize; ++i)
                        r2 += static_cast<real>(data[i]);
        clock::time_point const t2 = clock::now();
        std::cout <<
                "specialized  : sum: " << r1 << ", time: " << std::chrono::nanoseconds(t1 - t0).count() * 1e-9 << "s\n"
                "standard     : sum: " << r2 << ", time: " << std::chrono::nanoseconds(t2 - t1).count() * 1e-9 << "s\n";
        assert(r1 == r2);
        delete data;
3. anonymous -- 2012-04-13 10:38

The union trick, still supports by almost every compiler is also invalid in regards to the strict aliasing rules !

The standard say that only the union initialised member can be accessed. The only valid way to copy from int to float and float to int is to memcpy by casting the variable adress to (char*), as it is the only guarantee behavior.

Btw, union works, and will probably works in futur for sure, so who care, just have in mind that PPC will not be very happy on the performance side with that kind of tricks and you will suffers from heavy LHS issues

4. sam -- 2012-04-15 01:42

@anonymous: the C standard (C99, TC3, section explicitly says it's legal regardless of the aliasing rules. You are right that the union is not legally supposed to work in C++ (hence the need for reinterpret_cast).

5. john -- 2015-07-08 11:47

It's easy to extend the code to convert signed integers to float and double. Here are two very fast examples using gcc x86-64 assembler (Linux, AT&T syntax).

converting signed integers to float (-223 < i < +223)

        mov      %edi, %eax                // EDI, EAX = signed integer
        mov      $0x4B000000, %ecx         // ECX = 8388608.0f (2^23)
        cdq                                // EDX = signbits
        neg      %edi                      // EDI = -integer
        and      $0x80000000, %edx         // EDX = sign
        cmovne   %edi, %eax                // EAX = mantissa
        or       %edx, %ecx                // ECX = sign | exponent (+-)2^23
        movd     %ecx, %xmm1               // XMM1 = (+-)8388608.0f
        or       %ecx, %eax                // EAX = transition float
        movd     %eax, %xmm0
        subss    %xmm1, %xmm0              // XMM0 = float value

converting signed integers to double (-252 < i < +252)

        mov      %rdi, %rax                // RDI, RAX = signed integer
        mov      $0x4330000000000000, %rcx // RCX = 4503599627370496.0 (2^52)
        mov      $0x8000000000000000, %rsi // RSI = sign mask
        cqo                                // RDX = signbits
        neg      %rdi                      // RDI = -integer
        and      %rsi, %rdx                // RDX = sign
        cmovne   %rdi, %rax                // RAX = mantissa
        or       %rdx, %rcx                // RCX = sign | exponent (+-)2^52
        movq     %rcx, %xmm1               // XMM1 = (+-)4503599627370496.0
        or       %rcx, %rax                // RAX = transition double
        movq     %rax, %xmm0
        subsd    %xmm1, %xmm0              // XMM0 = double value
360. hillarydavidson -- 2017-12-19 11:22

Just to show off my unique personality, I tried new and latest fashions as well as keep myself in touch with latest updates. For all kinds of news I just take a glance on ​https://www.dissertationwritinguk.co.uk/write-my-dissertation

these news and enjoy uniqueness.

361. susanrichard -- 2017-12-29 12:08

Naturally, a tyke procedure acquires the earth factors of its parent procedure. Projects began by the order processor acquire the summon processor's condition factors. To indicate an alternate situation for a kid procedure, make another condition square and pass a pointer to it as a parameter to the CreateProcess work. For more information please visit to our website https://www.assignmenthelperuk.co.uk

362. anonymous -- 2018-01-16 06:47

I am happy to be on this website, you have posted very nice tutorial which i like reading. I have learned a lot from this post which will help me in future. Purchase gym flooring from http://www.gymflooringco.co.uk/gym-flooring.html

363. anonymous -- 2018-01-22 08:33

Maybe this issue ought to be a clue to chip architects to discover what software engineers require before planning processors. Were the Pentium architects ignorant that the most usually utilized programming dialects change over buoys to ints by truncation? What's more, if they knew, why for heaven&#39;s purpose didn&#39;t they incorporate a guideline to do this paying little respect to the present adjusting mode? https://www.assignmentglobe.com/buy-assignment

364. Dubai Offshore Company Formation -- 2018-01-25 05:47

This is great information for students. This article is very helpful i really like this blog thanks. https://www.uaefreezone.xyz/ I also have some information relevant for online dissertation help.

365. Help With Psychology Assignments -- 2018-01-25 05:48

I am so happy to read this. https://www.psychologyassignments.com/ This is the kind of manual that needs to be given and not the random misinformation that's at the other blogs.

366. Eleston -- 2018-01-29 08:00

I am very happy to visit in the blog. I am reading the all conversation increasing the knowledge. I am appreciate in the work.keep it up. http://garagerackingco.uk/

367. aadilkhan9409@gmail.com -- 2018-01-29 08:00

Thanks for a sharing it.I found a lot of interesting information here. http://freerobux.spruz.com/.A really good post very thankful and hopeful that you will many more posts like this one

368. Harry -- 2018-01-29 12:42

I am very happy to visit in the post. In the post provided the important information for the everyone new visitor. I would like to say that the article is really very good. http://gymmatscompany.co.uk/gym-mats.html

369. Thesis Writing Service -- 2018-02-01 06:48

This is really a great stuff for sharing. https://www.dissertationinc.com/

Keep it up .Thanks for sharing.

370. anonymous -- 2018-02-05 11:44

Excellent read, Positive site, where did u come up with the information on this posting? I have read a few of the on your website now, and I really like your style. Thanks a million and please keep up the effective work https://www.assignmentsolution.co.uk/write-my-assignment

371. Clark -- 2018-02-10 07:06

I am very impress to seeing the some different coding of here. I am glad to know that the article is really very good. Thanks for sharing in the information. keep it up. http://www.rollstickersco.com.au/

372. Kolkata Escorts -- 2018-02-20 20:58

Beauti Queen amazing kolkata escorts agency alwyas ready for you at your doorstep. Beauti Queen provide you high profile independent girl in kolkata, escorts service in kolkata, kolkata escorts. http://www.kolkataqueen.com/

373. Online Marketing Homework Help -- 2018-02-21 12:53

https://www.marketingassignmentz.com/ I personally like your post; you have shared good insights and experiences. Keep it up.

374. Kolkata Escorts -- 2018-02-23 21:33

Renu Das Kolkata Escorts Services has gorgeous females provides Independent Escorts Service in Kolkata call girls at 100% satisfaction with VIP models. Provided Kolkata escorts at our agency are professional in nature and are eager to serve you at your place. <a href="http://www.renudas.in/">Kolkata Escorts</a> <a href="http://www.renudas.in/">Escorts in Kolkata</a> http://www.renudas.in

375. Delhi Escorts -- 2018-03-03 19:34

Riya Chaudhary is an Independent Delhi Escort girl, who offers dazzling hot Delhi Escorts Service, incall or outcall services both to complete your dreams in real entertainment. We offers high class Independent Escorts in Delhi at Riya Chaudhary Delhi Escorts service agency for elite peoples in the city to avail ultimate satisfaction. you can spend your beautiful moments. <a href="http://www.riyachaudhary.in/">Delhi Escorts</a> <a href="http://www.riyachaudhary.in/">Independent Escorts in Delhi</a> http://www.riyachaudhary.in

376. paulg.kasper12@gmail.com -- 2018-03-09 07:44

I loved the way you discuss the topic great work thanks for the share Your informative post. https://dumbbellsaudit.com

Add New Comment