Argh code bleaugh.
Jul. 29th, 2005 09:34 pmSo it has come to my attention that graphics code has two requirements:
Okay, fine. I have a number of primitives. AlphaBlend, HorizLine (for polygons), TransformSpan, FinishAccumRow (part of the box-interpolated affine transform integral), and all that great stuff. Currently, a lot of these alpha-blend. There is even minor structural support for fallback in case someone doesn't have MMX. (Nevermind that a first-generation Pentium simply isn't fast enough to do what I want...)
But what about other blending modes? If I made a full set of primitives for them all, that really adds up!
Is it more reasonable to pass in MyFunkyBlenderPtr and call it on every pixel? On a K7, call/ret turnaround is at least 9 cycles. Multiply by a reasonable 1Mpixels and you get 9M/700MHz on my machine... 13ms. Half the headroom for something running at 30fps! Although it is not really this bad—K7/P3 can predict call/ret properly. This would still place restrictions on loop unrolling and register usage, though.
Of course it's possible to blend after the fact; today's caches are plenty large enough to hold a scratch scanline or two, and call MyFunkyBlenderPtr on that. I expect the loop and copy overhead would be a little under or equivalent to the procedure call. Maybe less since the shorter loops can be unrolled. I should profile this...
The fanciest option is similar to what FFTW uses. Write a specialized compiler and feed it whatever algorithms and specifications you want. Sometimes a good system even outperforms hand-tuned code.
Oh, and finally there's the issue of rendering to 16- and 24-bit displays. This needs more routines as well, but since those aren't good for internal use, it may as well be a postprocess.
Hm. Just ranting...
- Needs to be flexible.
- Needs to be fast.
Okay, fine. I have a number of primitives. AlphaBlend, HorizLine (for polygons), TransformSpan, FinishAccumRow (part of the box-interpolated affine transform integral), and all that great stuff. Currently, a lot of these alpha-blend. There is even minor structural support for fallback in case someone doesn't have MMX. (Nevermind that a first-generation Pentium simply isn't fast enough to do what I want...)
But what about other blending modes? If I made a full set of primitives for them all, that really adds up!
Is it more reasonable to pass in MyFunkyBlenderPtr and call it on every pixel? On a K7, call/ret turnaround is at least 9 cycles. Multiply by a reasonable 1Mpixels and you get 9M/700MHz on my machine... 13ms. Half the headroom for something running at 30fps! Although it is not really this bad—K7/P3 can predict call/ret properly. This would still place restrictions on loop unrolling and register usage, though.
Of course it's possible to blend after the fact; today's caches are plenty large enough to hold a scratch scanline or two, and call MyFunkyBlenderPtr on that. I expect the loop and copy overhead would be a little under or equivalent to the procedure call. Maybe less since the shorter loops can be unrolled. I should profile this...
The fanciest option is similar to what FFTW uses. Write a specialized compiler and feed it whatever algorithms and specifications you want. Sometimes a good system even outperforms hand-tuned code.
Oh, and finally there's the issue of rendering to 16- and 24-bit displays. This needs more routines as well, but since those aren't good for internal use, it may as well be a postprocess.
Hm. Just ranting...