Beyond De Bruijn: fast binary logarithm of a 10-bit number
Recently I needed a method for retrieving the binary logarithm of a 10-bit number (for the curious, it was for the purpose of converting between 32-bit and 16-bit floating point numbers).
Computing the binary logarithm is equivalent to knowing the position of the highest order set bit. For instance, log2(0x1) is 0 and log2(0x100) is 8.
One well known method for fast binary logarithm is presented at Bit Twiddling Hacks. It is a two-step method where first all lower bits are set to 1 and then a De Bruijn-like sequence is used to perform a table lookup:
int fastlog2(uint32_t v) { static const int MultiplyDeBruijnBitPosition[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; v |= v >> 16; return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27]; }
That is 12 integer operations and a table lookup.
Optimising
It should be obvious what the sequence of operations on v does: fill the integer with ones starting from the highest order bit. Here are a few examples of what happens to v at each step:
v | v |= v >> 1 | v |= v >> 2 | v |= v >> 4 | v |= v >> 8 | v |= v >> 16 |
0x0001 | 0x0001 | 0x0001 | 0x0001 | 0x0001 | 0x0001 |
0x0002 | 0x0003 | 0x0003 | 0x0003 | 0x0003 | 0x0003 |
0x0003 | 0x0003 | 0x0003 | 0x0003 | 0x0003 | 0x0003 |
0x0004 | 0x0006 | 0x0007 | 0x0007 | 0x0007 | 0x0007 |
0x0100 | 0x0180 | 0x01e0 | 0x01fe | 0x01ff | 0x01ff |
0x80000000 | 0xc0000000 | 0xf0000000 | 0xff000000 | 0xffff0000 | 0xffffffff |
There is one obvious optimisation available: since the input is only 10-bit, the last shift operation v |= v >> 16 can be omitted because the final value was already reached.
int fastlog2(uint32_t v) { static const int MultiplyDeBruijnBitPosition[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; v |= v >> 1; v |= v >> 2; v |= v >> 4; v |= v >> 8; return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27]; }
10 instructions instead of 12. Not really amazing, but worth mentioning.
Optimising more?
Could we do better? Now the last line is v |= v >> 8; and it is only useful to propagate the 9th and 10th bits to positions 1 and 2. What happens if we omit that line? Let’s see:
- For most values of v, the expected value is obtained.
- For values of v with a highest order bit at 9th position, 0x1fe could be obtained instead of 0x1ff.
- For values of v with a highest order bit at 10th position, one of 0x3fc, 0x3fd or 0x3fe could be obtained instead of 0x3ff.
The list of possible output values would therefore be 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f, 0x7f, 0xff, 0x1fe, 0x1ff, 0x3fc, 0x3fd, 0x3fe, 0x3ff. What happens to these values when multiplying them with the De Bruijn sequence? Let's see:
v | v * 0x07C4ACDDU) >> 27 |
0x1 | 0 |
0x3 | 2 |
0x7 | 6 |
0xf | 14 |
0x1f | 30 |
0x3f | 29 |
0x7f | 27 |
0xff | 23 |
0x1fe | 15 |
0x1ff | 16 |
0x3fc | 30 |
0x3fd | 31 |
0x3fe | 0 |
0x3ff | 1 |
Damn! Two values are colliding. It looks like we cannot omit the last line after all.
Beyond De Bruijn
Let’s give another try at the problem. Usually De Bruijn sequences are built using either nontrivial algorithms, or brute force. Maybe we could find another sequence that has no collision? Or a sequence that is not a De Bruijn sequence but that works for our problem?
Well, let’s just brute force!
(2 seconds later)
int fastlog2(uint32_t v) { static const int MagicTable[16] = { 0, 1, 2, 8, -1, 3, 5, 9, 9, 7, 4, -1, 6, -1, -1, -1 }; v |= v >> 1; v |= v >> 2; v |= v >> 4; return MagicTable[(uint32_t)(v * 0x5a1a1a2u) >> 28]; }
Down to 8 instructions instead of 12. And the lookup table is now half the size!
Conclusion
It is possible for multiply-and-shift techniques similar to the De Bruijn sequence algorithm to exist for a larger set of problems. Brute forcing the search is a totally valid method for 32-bit multiplication.
The code used for this article is included in the attached file.