Cute bitwise math tricks
Jul. 13th, 2006 08:18 pmSome months ago I found a couple of brilliant pages (1, 2) on dealing with graphics data. By which I mean, four-component vectors packed into machine words. E.g. 0xAARRGGBB or struct { uint8_t r, g, b, a; }.
Applications like Photoshop or OC, are really just a slick interface over all these vectors. Since there are naturally quite a lot (e.g. tens of megabytes) of them, the authors have invested quite a lot of effort to speed up the math.
This is the lovely part. You don't have to know calculus, but you do need a twisted mind. For example, these two snippets add and subtract vectors, limiting the results to byte range. So annoying to do componentwise, but look!
They compile to 12..14 intel instructions, adding in parallel and saving up to three expensive branches. Some manual assembler can even fix the carry problem (on the add) and do the full vector.
Here's a premultiplied alpha blend, resembling code from the linked page. They are computing destC = destC*(255-srcA)/255 + srcC. The first is approximate, and the second (in a not so obvious way) exact:
This is based on the principle that 4 * 1020304 = 4 * (1000000 + 20000 + 300 + 4) = 4081216. The coefficient multiplies digits independently, preserving their place values. So while it looks complicated, it saves two multiplies and (in the second case) four (!) divides. The remaining logic is easily parallelized on its own, a cinch for modern compilers and CPUs.
Finally, here's one I noticed! It's just a componentwise product, treating the bytes as unit fractions:
This one's approximate, and needs a 32x32->64bit multiply (usually an inline library function or some assembler). Though 0*0 = 0, and 255*255 = 255, as intended. It's actually not so obvious why this works. The short story is that it's based on the expansion of (a + b)(x + y)...
(216*b + a) * (216*y + x) = 232*b*y + 216*(b*x + a*y) + a*x
( explanation )
Applications like Photoshop or OC, are really just a slick interface over all these vectors. Since there are naturally quite a lot (e.g. tens of megabytes) of them, the authors have invested quite a lot of effort to speed up the math.
This is the lovely part. You don't have to know calculus, but you do need a twisted mind. For example, these two snippets add and subtract vectors, limiting the results to byte range. So annoying to do componentwise, but look!
sum = x + y; low_bits = (x ^ y) & 0x010101; carries = (sum - low_bits) & 0x01010100; modulo = sum - carries; clamp = carries - (carries >> 8); result = modulo | clamp; (modified, the original page did 16 bits) | diff = (x - y) + 0x01010100; low_bits = (x ^ y) & 0x01010100; borrows = (diff - low_bits) & 0x01010100; modulo = diff - borrows; clamp = borrows - (borrows >> 8); result = modulo & clamp; |
Here's a premultiplied alpha blend, resembling code from the linked page. They are computing destC = destC*(255-srcA)/255 + srcC. The first is approximate, and the second (in a not so obvious way) exact:
dest_ag = (dest & 0xff00ff00) >> 8; dest_rb = dest & 0x00ff00ff; alpha = 0x100 - (src >> 24); dest_ag *= alpha; dest_rb *= alpha; dest_ag = dest_ag & 0xff00ff00; dest_rb = (dest_rb >> 8) & 0x00ff00ff; dest = dest_ag + dest_rb + src; | dest_ag = (dest & 0xff00ff00) >> 8; dest_rb = dest & 0x00ff00ff; alpha = 0xff - (src >> 24); dest_ag = alpha * dest_ag + 0x00800080; dest_rb = alpha * dest_rb + 0x00800080; dest_ag = ((dest_ag >> 8) & 0x00ff00ff) + dest_ag; dest_rb = ((dest_rb >> 8) & 0x00ff00ff) + dest_rb; dest_ag = dest_ag & 0xff00ff00; dest_rb = (dest_rb >> 8) & 0x00ff00ff; dest = dest_ag + dest_rb + src; |
Finally, here's one I noticed! It's just a componentwise product, treating the bytes as unit fractions:
dest_ag = (dest & 0xff00ff00) >> 8; dest_rb = dest & 0x00ff00ff; src_ag = (src & 0xff00ff00) >> 8; src_rb = src & 0x00ff00ff; dest_ag *= src_ag + 0x00010001; dest_rb *= src_rb + 0x00010001; dest_ag = (dest_ag >> 16 & 0xff000000) + (dest_ag & 0x0000ff00); dest_rb = (dest_rb >> 24 & 0x00ff0000) + (dest_rb >> 8 & 0x000000ff); dest = dest_ag + dest_rb; |
(216*b + a) * (216*y + x) = 232*b*y + 216*(b*x + a*y) + a*x
( explanation )