Cute bitwise math tricks
Jul. 13th, 2006 08:18 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
Some months ago I found a couple of brilliant pages (1, 2) on dealing with graphics data. By which I mean, four-component vectors packed into machine words. E.g. 0xAARRGGBB or struct { uint8_t r, g, b, a; }.
Applications like Photoshop or OC, are really just a slick interface over all these vectors. Since there are naturally quite a lot (e.g. tens of megabytes) of them, the authors have invested quite a lot of effort to speed up the math.
This is the lovely part. You don't have to know calculus, but you do need a twisted mind. For example, these two snippets add and subtract vectors, limiting the results to byte range. So annoying to do componentwise, but look!
They compile to 12..14 intel instructions, adding in parallel and saving up to three expensive branches. Some manual assembler can even fix the carry problem (on the add) and do the full vector.
Here's a premultiplied alpha blend, resembling code from the linked page. They are computing destC = destC*(255-srcA)/255 + srcC. The first is approximate, and the second (in a not so obvious way) exact:
This is based on the principle that 4 * 1020304 = 4 * (1000000 + 20000 + 300 + 4) = 4081216. The coefficient multiplies digits independently, preserving their place values. So while it looks complicated, it saves two multiplies and (in the second case) four (!) divides. The remaining logic is easily parallelized on its own, a cinch for modern compilers and CPUs.
Finally, here's one I noticed! It's just a componentwise product, treating the bytes as unit fractions:
This one's approximate, and needs a 32x32->64bit multiply (usually an inline library function or some assembler). Though 0*0 = 0, and 255*255 = 255, as intended. It's actually not so obvious why this works. The short story is that it's based on the expansion of (a + b)(x + y)...
(216*b + a) * (216*y + x) = 232*b*y + 216*(b*x + a*y) + a*x
The cool part about this is if you're using intel asm, b*y and a*x wind up in separate registers! You just have to mask off the products. There's only one problem, and that is the middle 216 "garbage" term above. It can be as large as 17 bits, and it shows up in the middle of the result. The first 16 are no problem since they can be masked. But the 17th overlaps the b*y product...
Well, the products here are 8 bits, so you don't have to worry about the low halves. The only bad thing that might happen is garbage bits carrying over and incrementing the high half of b*y, if b*y has no zero bits to absorb it. But b*y is at most 0xff00, which has the zero bits, so it can't overflow.
Still, if b*y = 0x100*N + 0xff, the product will ripple over and increment the N portion we cared about. When does that happen? Kind of complicated:
- Think of it this way. b and y must obviously be factors of 0x100*N + 0xff (there are quite a number of these);
- Then the garbage, 216*(b*x + a*y), has to be large enough (>= 0x10000) to spill (when b*y < 0x4000, this won't happen);
- None of a/b/x/y ever go over 0x100, so this also weeds out a ton of factor pairs.
So I logged onto Chaotic, wrote some MUSHcode, and worked it out. Turns out, there are 63 byte-sized pairs giving b*y = 0x100*N + 0xff, but most of them fail the second point. Only about 26 combinations (0.00000060536% of 232!) should even touch the bits I want.
Pretty nice. I tested it (18 instructions), displaying OC drawings, it's fast and looks great. The only thing I need is a real benchmark, I've never really seen how the speed compares between all the different algorithms. It seems like it'd be most useful on the old Pentiums, and on RISC platforms that prefer word-size loads.
(I realise this is totally irrelevant since the PCs nowadays have fancy processors with the cheatish SIMD instructions that do it all for you. Or, better yet, you can run this sort of integer code in
parallel with the SIMD stuff.)
Applications like Photoshop or OC, are really just a slick interface over all these vectors. Since there are naturally quite a lot (e.g. tens of megabytes) of them, the authors have invested quite a lot of effort to speed up the math.
This is the lovely part. You don't have to know calculus, but you do need a twisted mind. For example, these two snippets add and subtract vectors, limiting the results to byte range. So annoying to do componentwise, but look!
sum = x + y; low_bits = (x ^ y) & 0x010101; carries = (sum - low_bits) & 0x01010100; modulo = sum - carries; clamp = carries - (carries >> 8); result = modulo | clamp; (modified, the original page did 16 bits) | diff = (x - y) + 0x01010100; low_bits = (x ^ y) & 0x01010100; borrows = (diff - low_bits) & 0x01010100; modulo = diff - borrows; clamp = borrows - (borrows >> 8); result = modulo & clamp; |
Here's a premultiplied alpha blend, resembling code from the linked page. They are computing destC = destC*(255-srcA)/255 + srcC. The first is approximate, and the second (in a not so obvious way) exact:
dest_ag = (dest & 0xff00ff00) >> 8; dest_rb = dest & 0x00ff00ff; alpha = 0x100 - (src >> 24); dest_ag *= alpha; dest_rb *= alpha; dest_ag = dest_ag & 0xff00ff00; dest_rb = (dest_rb >> 8) & 0x00ff00ff; dest = dest_ag + dest_rb + src; | dest_ag = (dest & 0xff00ff00) >> 8; dest_rb = dest & 0x00ff00ff; alpha = 0xff - (src >> 24); dest_ag = alpha * dest_ag + 0x00800080; dest_rb = alpha * dest_rb + 0x00800080; dest_ag = ((dest_ag >> 8) & 0x00ff00ff) + dest_ag; dest_rb = ((dest_rb >> 8) & 0x00ff00ff) + dest_rb; dest_ag = dest_ag & 0xff00ff00; dest_rb = (dest_rb >> 8) & 0x00ff00ff; dest = dest_ag + dest_rb + src; |
Finally, here's one I noticed! It's just a componentwise product, treating the bytes as unit fractions:
dest_ag = (dest & 0xff00ff00) >> 8; dest_rb = dest & 0x00ff00ff; src_ag = (src & 0xff00ff00) >> 8; src_rb = src & 0x00ff00ff; dest_ag *= src_ag + 0x00010001; dest_rb *= src_rb + 0x00010001; dest_ag = (dest_ag >> 16 & 0xff000000) + (dest_ag & 0x0000ff00); dest_rb = (dest_rb >> 24 & 0x00ff0000) + (dest_rb >> 8 & 0x000000ff); dest = dest_ag + dest_rb; |
(216*b + a) * (216*y + x) = 232*b*y + 216*(b*x + a*y) + a*x
The cool part about this is if you're using intel asm, b*y and a*x wind up in separate registers! You just have to mask off the products. There's only one problem, and that is the middle 216 "garbage" term above. It can be as large as 17 bits, and it shows up in the middle of the result. The first 16 are no problem since they can be masked. But the 17th overlaps the b*y product...
Well, the products here are 8 bits, so you don't have to worry about the low halves. The only bad thing that might happen is garbage bits carrying over and incrementing the high half of b*y, if b*y has no zero bits to absorb it. But b*y is at most 0xff00, which has the zero bits, so it can't overflow.
Still, if b*y = 0x100*N + 0xff, the product will ripple over and increment the N portion we cared about. When does that happen? Kind of complicated:
- Think of it this way. b and y must obviously be factors of 0x100*N + 0xff (there are quite a number of these);
- Then the garbage, 216*(b*x + a*y), has to be large enough (>= 0x10000) to spill (when b*y < 0x4000, this won't happen);
- None of a/b/x/y ever go over 0x100, so this also weeds out a ton of factor pairs.
So I logged onto Chaotic, wrote some MUSHcode, and worked it out. Turns out, there are 63 byte-sized pairs giving b*y = 0x100*N + 0xff, but most of them fail the second point. Only about 26 combinations (0.00000060536% of 232!) should even touch the bits I want.
Pretty nice. I tested it (18 instructions), displaying OC drawings, it's fast and looks great. The only thing I need is a real benchmark, I've never really seen how the speed compares between all the different algorithms. It seems like it'd be most useful on the old Pentiums, and on RISC platforms that prefer word-size loads.
(I realise this is totally irrelevant since the PCs nowadays have fancy processors with the cheatish SIMD instructions that do it all for you. Or, better yet, you can run this sort of integer code in
parallel with the SIMD stuff.)