Beyond De Bruijn: fast binary logarithm of a 10bit number
Recently I needed a method for retrieving the binary logarithm of a 10bit number (for the curious, it was for the purpose of converting between 32bit and 16bit floating point numbers).
Computing the binary logarithm is equivalent to knowing the position of the highest order set bit. For instance, log2(0x1)
is 0 and log2(0x100)
is 8.
One well known method for fast binary logarithm is presented at Bit Twiddling Hacks. It is a twostep method where first all lower bits are set to 1 and then a De Bruijnlike sequence is used to perform a table lookup:
int fastlog2(uint32_t v) { static const int MultiplyDeBruijnBitPosition[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; v = v >> 1; v = v >> 2; v = v >> 4; v = v >> 8; v = v >> 16; return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27]; }
That is 12 integer operations and a table lookup.
Optimising
It should be obvious what the sequence of operations on v
does: fill the integer with ones starting from the highest order bit. Here are a few examples of what happens to v
at each step:
v  v = v >> 1  v = v >> 2  v = v >> 4  v = v >> 8  v = v >> 16

0x0001  0x0001  0x0001  0x0001  0x0001  0x0001 
0x0002  0x0003  0x0003  0x0003  0x0003  0x0003 
0x0003  0x0003  0x0003  0x0003  0x0003  0x0003 
0x0004  0x0006  0x0007  0x0007  0x0007  0x0007 
0x0100  0x0180  0x01e0  0x01fe  0x01ff  0x01ff 
0x80000000  0xc0000000  0xf0000000  0xff000000  0xffff0000  0xffffffff 
There is one obvious optimisation available: since the input is only 10bit, the last shift operation v = v >> 16
can be omitted because the final value was already reached.
int fastlog2(uint32_t v) { static const int MultiplyDeBruijnBitPosition[32] = { 0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30, 8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31 }; v = v >> 1; v = v >> 2; v = v >> 4; v = v >> 8; return MultiplyDeBruijnBitPosition[(uint32_t)(v * 0x07C4ACDDU) >> 27]; }
10 instructions instead of 12. Not really amazing, but worth mentioning.
Optimising more?
Could we do better? Now the last line is v = v >> 8;
and it is only useful to propagate the 9th and 10th bits to positions 1 and 2. What happens if we omit that line? Let’s see:
 For most values of
v
, the expected value is obtained.  For values of
v
with a highest order bit at 9th position,0x1fe
could be obtained instead of0x1ff
.  For values of
v
with a highest order bit at 10th position, one of0x3fc
,0x3fd
or0x3fe
could be obtained instead of0x3ff
.
The list of possible output values would therefore be 0x1
, 0x3
, 0x7
, 0xf
, 0x1f
, 0x3f
, 0x7f
, 0xff
, 0x1fe
, 0x1ff
, 0x3fc
, 0x3fd
, 0x3fe
, 0x3ff
. What happens to these values when multiplying them with the De Bruijn sequence? Let's see:
v  v * 0x07C4ACDDU) >> 27

0x1  0 
0x3  2 
0x7  6 
0xf  14 
0x1f  30 
0x3f  29 
0x7f  27 
0xff  23 
0x1fe  15 
0x1ff  16 
0x3fc  30 
0x3fd  31 
0x3fe  0 
0x3ff  1 
Damn! Two values are colliding. It looks like we cannot omit the last line after all.
Beyond De Bruijn
Let’s give another try at the problem. Usually De Bruijn sequences are built using either nontrivial algorithms, or brute force. Maybe we could find another sequence that has no collision? Or a sequence that is not a De Bruijn sequence but that works for our problem?
Well, let’s just brute force!
(2 seconds later)
int fastlog2(uint32_t v) { static const int MagicTable[16] = { 0, 1, 2, 8, 1, 3, 5, 9, 9, 7, 4, 1, 6, 1, 1, 1 }; v = v >> 1; v = v >> 2; v = v >> 4; return MagicTable[(uint32_t)(v * 0x5a1a1a2u) >> 28]; }
Down to 8 instructions instead of 12. And the lookup table is now half the size!
Conclusion
It is possible for multiplyandshift techniques similar to the De Bruijn sequence algorithm to exist for a larger set of problems. Brute forcing the search is a totally valid method for 32bit multiplication.
The code used for this article is included in the attached file.
Attachments (1)
 debruijn.cpp (2.1 KB)  added by 5 years ago.
Download all attachments as: .zip
Comments
Not to spoil the fun, but bsr can do what you are doing much faster. GCC even has a builtin for it namely, builtin_clz.
@anonymous: Yes, you are right. I hope you will agree to the pedagogical value of the example nonetheless.
Actually, on Intel Atom and AMD K8, BSR is very slow and the example may be faster in inner loops. However, even on such processors, BSR may be better because less cache space (instruction as well as data) is used.
iUQr2d Thank you, I ave been seeking for details about this subject for ages and yours is the best I ave found so far.
Hello. And Bye.
Hello. And Bye.
NYZj6K Really informative blog article. Really Cool.