Jul. 13th, 2006

talarubi: (Default)
Some months ago I found a couple of brilliant pages (1, 2) on dealing with graphics data. By which I mean, four-component vectors packed into machine words. E.g. 0xAARRGGBB or struct { uint8_t r, g, b, a; }.

Applications like Photoshop or OC, are really just a slick interface over all these vectors. Since there are naturally quite a lot (e.g. tens of megabytes) of them, the authors have invested quite a lot of effort to speed up the math.

This is the lovely part. You don't have to know calculus, but you do need a twisted mind. For example, these two snippets add and subtract vectors, limiting the results to byte range. So annoying to do componentwise, but look!
sum = x + y;
low_bits = (x ^ y) & 0x010101;
carries = (sum - low_bits) & 0x01010100;
modulo = sum - carries;
clamp = carries - (carries >> 8);
result = modulo | clamp;


(modified, the original page did 16 bits)
diff = (x - y) + 0x01010100;
low_bits = (x ^ y) & 0x01010100;
borrows = (diff - low_bits) & 0x01010100;
modulo = diff - borrows;
clamp = borrows - (borrows >> 8);
result = modulo & clamp;
They compile to 12..14 intel instructions, adding in parallel and saving up to three expensive branches. Some manual assembler can even fix the carry problem (on the add) and do the full vector.

Here's a premultiplied alpha blend, resembling code from the linked page. They are computing destC = destC*(255-srcA)/255 + srcC. The first is approximate, and the second (in a not so obvious way) exact:
dest_ag = (dest & 0xff00ff00) >> 8;
dest_rb = dest & 0x00ff00ff;
alpha = 0x100 - (src >> 24);
dest_ag *= alpha;
dest_rb *= alpha;
dest_ag = dest_ag & 0xff00ff00;
dest_rb = (dest_rb >> 8) & 0x00ff00ff;
dest = dest_ag + dest_rb + src;
dest_ag = (dest & 0xff00ff00) >> 8;
dest_rb = dest & 0x00ff00ff;
alpha = 0xff - (src >> 24);
dest_ag = alpha * dest_ag + 0x00800080;
dest_rb = alpha * dest_rb + 0x00800080;
dest_ag = ((dest_ag >> 8) & 0x00ff00ff) + dest_ag;
dest_rb = ((dest_rb >> 8) & 0x00ff00ff) + dest_rb;
dest_ag = dest_ag & 0xff00ff00;
dest_rb = (dest_rb >> 8) & 0x00ff00ff;
dest = dest_ag + dest_rb + src;
This is based on the principle that 4 * 1020304 = 4 * (1000000 + 20000 + 300 + 4) = 4081216. The coefficient multiplies digits independently, preserving their place values. So while it looks complicated, it saves two multiplies and (in the second case) four (!) divides. The remaining logic is easily parallelized on its own, a cinch for modern compilers and CPUs.

Finally, here's one I noticed! It's just a componentwise product, treating the bytes as unit fractions:
dest_ag = (dest & 0xff00ff00) >> 8;
dest_rb = dest & 0x00ff00ff;
src_ag = (src & 0xff00ff00) >> 8;
src_rb = src & 0x00ff00ff;
dest_ag *= src_ag + 0x00010001;
dest_rb *= src_rb + 0x00010001;
dest_ag = (dest_ag >> 16 & 0xff000000) + (dest_ag & 0x0000ff00);
dest_rb = (dest_rb >> 24 & 0x00ff0000) + (dest_rb >> 8 & 0x000000ff);
dest = dest_ag + dest_rb;
This one's approximate, and needs a 32x32->64bit multiply (usually an inline library function or some assembler). Though 0*0 = 0, and 255*255 = 255, as intended. It's actually not so obvious why this works. The short story is that it's based on the expansion of (a + b)(x + y)...

(216*b + a) * (216*y + x) = 232*b*y + 216*(b*x + a*y) + a*x

explanation )

Profile

talarubi: (Default)
Talarubi

January 2007

S M T W T F S
 123456
78910111213
14151617181920
21222324252627
28293031   

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 11th, 2025 08:37 am
Powered by Dreamwidth Studios