**Update Oct 27, 2010:** Thanks to Shai Rubinshtein (see comments) for pointing out an issue with AALine(), source code updated below and in link.

Some (long) time ago I decided to port Fantasia Painter to Windows Phone 7. When I ran it on an actual device (my friend’s, no I don’t have it yet unfortunately) last week, updating the screen while painting had about 0.5 second lag. This lag is quite unacceptable for a painting app. Here’s how I fixed it.

Turns out, WriteableBitmap.Render() and Invalidate() incur some significant performance cost. I decided to replace the .Render(new Line()) with my own line-drawing function.

I ended up creating my own anti-aliased and alpha blending function, that has similar quality to Render(new Line()) and runs 4.8 times faster. Here’s a proof that it works :

## Contents of This Post

- Line Drawing Algorithms
- Optimizing for Windows Phone 7
- The Actual Optimizations and Source Code

### Line Drawing Algorithms

At first, I decided to use the basic aliased line that I had (now part of http://writeablebitmapex.codeplex.com/.) It turned out pretty ugly – the small spikes on the Furball brush became fat and aliased lines. At least it was running amazingly fast

After a bit of research, I found out Wu’s and Gupta-Sproull’s anti-aliasing algorithms. Wu’s algorithm is fast, but the lines were too thin for my liking, the drawings didn’t look as “natural”.

You can look at a comparison pictures between the various algorithms here: http://igraphicsgroup.com/blog/2009/04/looking_at_different_types_of.html . On the comparison site: AGG (http://www.antigrain.com/demo/index.html) is an open source library that yields the best results in my opinion, but is slower and unfortunately licensed under GPL, which makes it unusable for me since I prefer to license my code under MS-PL which is OK for commercial use.

Here’s the speed comparison of the various algorithms, as tested on desktop PC. I have run only some of them on the device as well…will publish more results when I run all of them.

Note that I have not finished optimizing Wu’s line algorithm. Also, DrawLineAlphaBlendOnOpaque is just an aliased line with alpha blending.

I decided to optimize the Gupta-Sproull algorithm and see how that works. Here are some sites I looked at:

- http://courses.engr.illinois.edu/ece390/archive/archive-f2000/mp/mp4/anti.html. Note the source code has a bug – it’s initializing uend to u+2*du instead of u+du, which makes some lines twice as long and would probably throw out of bounds exception in case you’re drawing lines close to the end of the pixels[] array.
- http://www.mat.univie.ac.at/~kriegl/Skripten/CG/node35.html
- http://gurge.com/blog/2006/11/29/gupta-sproull-antialiased-lines/
- Very good explanation of Gupta-Sproull http://www.cs.unc.edu/~pmerrell/comp575/Anti-Aliasing.ppt

The algorithm I used is based on http://courses.engr.illinois.edu/ece390/archive/archive-f2000/mp/mp4/anti.html#algo, with some tweaks and fixes.

### Optimizing for Windows Phone 7

If you go and look at the first wave of Windows 7 Phone phones here http://windowsphone7.com/ you’ll notice most all of them (at the time of this writing) feature Qualcomm Snapdragon chipset QSD8250 series. From this overview: http://www.pdadb.net/index.php?m=cpu&id=a8250&c=qualcomm_snapdragon_qsd8250, we can see that the processor is 32-bit with ARMv7 instruction set.

It’s a RISC processor with lots of registers. This tells us that we should try putting functions inline vs calling them – e.g. the AlphaBlendNormalOnPremultiplied() function below is a great candidate. We should also try to use int32’s mostly.

Here’s another overview of the Snapdragon chipset: http://www.arm.com/markets/mobile/qualcomm-snapdragon-chipset.php.

### The Actual Optimizations and Source Code

You can pick up the current source containing all functions (not well documented since work in progress, except the header of AALine()) mentioned above here: http://nokola.com/sources/DrawingHelper.zip

**Note on premultiplied values: **I had to take special care when alpha blending with WriteableBitmap pixels, since those are alpha-premultiplied. When using the source, make sure you pass the correct values (premultiplied or not depending on function.).

List of optimizations to the Gupta-Sproull algorithm – code below:

- Making sure that the distance is within the range 0..1, saving significant amount of multiplies in the alpha blending function AlphaBlendNormalOnPremultiplied()
- Since the alpha changes often, I changed the alpha blend function to blend “normal” (non-premultiplied) values, instead of converting the values to pre-multiplied format before calling the function (this saves ~6 or so multiplications per pixel)
- Changed distance to range 0..1023 and moved all floating point calculations outside the drawing loop. The distance computations in the loop are all fixed-point int arithmetic.
- Multiplied alpha to the “inc” variables before the loop, so we don’t have to multiply it by the distance in the AlphaBlendNormalOnPremultiplied(). Effectively moving the alpha*distance before the loop and replacing the multiplication with additions in the loop.
- Replacing all multiplications in the loop with additions, by having all distance increments be multiplied just before the loop
- In the blend function: replacing the high-quality RGB blending with a little bit lower quality (no visible difference whatsoever – maybe pixels are off by one). Note that alpha blending is still using the high-quality computation (value * 0x8081) >> 23 is the same as value / 255. This is optimized to value >> 8 which is the same as value / 256 for RGB.
- Updated blend function to use uint instead of int (tests on desktop showed 5% faster)
- Replaced the (g >> 8) << 8 with g & 0xFFFFFF00 in the blending function.
- Combined the computation of R and B values into one when blending. This saves a lot of computations per pixel!:

Disclaimer: I still have to test this on the phone to ensure I didn’t bug with the compiler optimizations, even though it seems it will yield much better results

Update Oct 18, 2010:Yeah it works great on actual device!

- Last but not least: unrolling (copy-pasting 3 times) the AlphaBlendNormalOnPremultiplied function in the loop. This is not shown below for simplicity (it is implemented here: http://nokola.com/sources/DrawingHelper.zip). It improves speed by 20%

Potential future optimizations: getting rid of the floating point arithmetic completely (takes 0.2% CPU time now so probably not a big deal); improving speed of alpha blending further (lookup tables?)

/// <summary>

/// Draws an antialiased line, using an optimized version of Gupta-Sproull algorithm

/// </summary>

/// <param name="pixels">Pixels from a WriteableBitmap to draw to (premultiplied alpha format)</param>

/// <param name="pixelWidth">Width of the bitmap</param>

/// <param name="pixelHeight">Height of the bitmap</param>

/// <param name="x0">Start X</param>

/// <param name="y0">Start Y</param>

/// <param name="x1">End X</param>

/// <param name="y1">End Y</param>

/// <param name="sa">Opacity of the line (0..255)</param>

/// <param name="srb">Non-premultiplied red and blue component in the format 0x00rr00bb</param>

/// <param name="sg">Green component (0..255)</param>

public static void AALine(int[] pixels, int pixelWidth, int pixelHeight, int x0, int y0, int x1, int y1, int sa, uint srb, uint sg)

{

if ((x0 == x1) && (y0 == y1)) return; // edge case causing invDFloat to overflow, found by Shai Rubinshtein

if (x0 < 1) x0 = 1;

if (x0 > pixelWidth - 2) x0 = pixelWidth - 2;

if (y0 < 1) y0 = 1;

if (y0 > pixelHeight - 2) y0 = pixelHeight - 2;

if (x1 < 1) x1 = 1;

if (x1 > pixelWidth - 2) x1 = pixelWidth - 2;

if (y1 < 1) y1 = 1;

if (y1 > pixelHeight - 2) y1 = pixelHeight - 2;

int addr = y0 * pixelWidth + x0;

int dx = x1 - x0;

int dy = y1 - y0;

int du;

int dv;

int u;

int v;

int uincr;

int vincr;

// By switching to (u,v), we combine all eight octants

int adx = dx, ady = dy;

if (dx < 0) adx = -dx;

if (dy < 0) ady = -dy;

if (adx > ady)

{

du = adx;

dv = ady;

u = x1;

v = y1;

uincr = 1;

vincr = pixelWidth;

if (dx < 0) uincr = -uincr;

if (dy < 0) vincr = -vincr;

}

else

{

du = ady;

dv = adx;

u = y1;

v = x1;

uincr = pixelWidth;

vincr = 1;

if (dy < 0) uincr = -uincr;

if (dx < 0) vincr = -vincr;

}

int uend = u + du;

int d = (dv << 1) - du; // Initial value as in Bresenham's

int incrS = dv << 1; // Δd for straight increments

int incrD = (dv - du) << 1; // Δd for diagonal increments

double invDFloat = 1.0 / (4.0 * Math.Sqrt(du * du + dv * dv)); // Precomputed inverse denominator

double invD2duFloat = 0.75 - 2.0 * (du * invDFloat); // Precomputed constant

const int PRECISION_SHIFT = 10; // result distance should be from 0 to 1 << PRECISION_SHIFT, mapping to a range of 0..1

const int PRECISION_MULTIPLIER = 1 << PRECISION_SHIFT;

int invD = (int)(invDFloat * PRECISION_MULTIPLIER);

int invD2du = (int)(invD2duFloat * PRECISION_MULTIPLIER * sa);

int ZeroDot75 = (int)(0.75 * PRECISION_MULTIPLIER * sa);

int invDMulAlpha = invD * sa;

int duMulInvD = du * invDMulAlpha; // used to help optimize twovdu * invD

int dMulInvD = d * invDMulAlpha; // used to help optimize twovdu * invD

//int twovdu = 0; // Numerator of distance; starts at 0

int twovduMulInvD = 0; // since twovdu == 0

int incrSMulInvD = incrS * invDMulAlpha;

int incrDMulInvD = incrD * invDMulAlpha;

do

{

AlphaBlendNormalOnPremultiplied(pixels, addr, (ZeroDot75 - twovduMulInvD) >> PRECISION_SHIFT, srb, sg);

AlphaBlendNormalOnPremultiplied(pixels, addr + vincr, (invD2du + twovduMulInvD) >> PRECISION_SHIFT, srb, sg);

AlphaBlendNormalOnPremultiplied(pixels, addr - vincr, (invD2du - twovduMulInvD) >> PRECISION_SHIFT, srb, sg);

if (d < 0)

{

// choose straight (u direction)

twovduMulInvD = dMulInvD + duMulInvD;

d += incrS;

dMulInvD += incrSMulInvD;

}

else

{

// choose diagonal (u+v direction)

twovduMulInvD = dMulInvD - duMulInvD;

d += incrD;

dMulInvD += incrDMulInvD;

v++;

addr += vincr;

}

u++;

addr += uincr;

} while (u < uend);

}

/// <summary>

/// Blends a specific source color on top of a destination premultiplied color

/// </summary>

/// <param name="pixels">Array containing destination color</param>

/// <param name="index">Index of destination pixel</param>

/// <param name="sa">Source alpha (0..255)</param>

/// <param name="srb">Source non-premultiplied red and blue component in the format 0x00rr00bb</param>

/// <param name="sg">Source green component (0..255)</param>

private static void AlphaBlendNormalOnPremultiplied(int[] pixels, int index, int sa, uint srb, uint sg)

{

uint destPixel = (uint)pixels[index];

uint da, dg, drb;

da = (destPixel >> 24);

dg = ((destPixel >> 8) & 0xff);

drb = destPixel & 0x00FF00FF;

// blend with high-quality alpha and lower quality but faster 1-off RGBs

pixels[index] = (int)(

((sa + ((da * (255 - sa) * 0x8081) >> 23)) << 24) | // aplha

(((sg - dg) * sa + (dg << 8)) & 0xFFFFFF00) | // green

(((((srb - drb) * sa) >> 8) + drb) & 0x00FF00FF) // red and blue

);

//dr = ((destPixel >> 16) & 0xff);

//db = ((destPixel) & 0xff);

//uint srb = (uint)((sr << 16) | sb);

//pixels[index] = (int)(

// ((sa + ((da * (255 - sa) * 0x8081) >> 23)) << 24) | // alpha

// (((((sr - dr) * sa) >> 8) + dr) << 16) | // red

// ( ((sg - dg) * sa + (dg << 8)) & 0xFFFFFF00 ) | // green

// ( (((sb - db) * sa) >> 8) + db ) ); // blue

}

Hope you like it! Please comment!