Shooyoo :Mobile Games and Entertainment Information Portal

Search: PositionHOME > Operator > Mandelbrot and 16-bit fixed point multiplies ( Part I )
0 Vote

Mandelbrot and 16-bit fixed point multiplies ( Part I )

  Post at : 2008-11-07 06:18:14   View:4  Zoom:【B M S】  

I recently had the opportunity to work on and help optimize a benchmark that uses fixed-point math to carry out an iterative calculation in a loop. The function of the benchmark is to calculate a Mandelbrot fractal in memory and report the time it takes to 'draw' the fractal as a rate. There are several different codepaths inside this benchmark, each implemented to take maximum advantage of different SIMD instruction sets, such as SSE2, SSSE3 and SSE4.1. The SSSE3 and SSE4.1 versions of this routine were approximately 2.7x faster than the SSE2 version. AMD processors support SSE2, SSE3 and SSE4a, so I wanted to investigate what could be done to optimize the SSE2 version of the function.
After I had a chance to visually inspect the various codepaths, it became obvious that the reason the SSSE3 and SSE4.1 routines had such a significant performance lead was due to the PMULHRSW instruction. There is not much literature available on this SSSE3 opcode, but it is defined as 'Packed Multiply High with Round and Scale' and is an instruction designed for fixed-point math. It operates on packed integer data; multiplying two packed 16-bit source words and producing a packed 16-bit destination word. The 16 bits that this instruction chooses to place in the destination register is a little unusual, as illustrated by Figure 1 below. When two 16 bit values are multiplied together, the result is a 32 bit value. However, in order to get 32 bits to fit in a 16 bit result, some bits have got to go, and the bits that PMULHRSW chooses to keep are significantly different than PMULHW or PMULLW. The red squares in the figure below represent bits PMULHRSW truncates, and the green bits are written as the result of the multiply.
Figure 1: Bit selection of PMULHRSW
The 31st bit is a redundant sign bit, so it gets truncated; this is an effect of the two 16-bit sources being signed inputs. Bits 30-15 are the next 16 most significant consecutive bits, and the rest of the least significant bits are truncated. For good measure, the most significant 14th truncated bit is rounded by adding a one before being truncated; this is where the 'round' comes from in the definition of the instruction and makes sense only for fixed point numbers, as this increases the accuracy of fractional bits. Since the most significant sign bit is truncated, the answer written to the destination register is logically left shifted by 1 (the 30th bit is now the most significant bit of the 16 bit result), which in effect is multiplying the result by two; this is where the 'scale' comes from in the definition of the instruction.
This particular Mandelbrot benchmark was originally written to operate on data in a 4.12 fixed point format. For those who feel a little rusty, this Book of Hook page provides a simple review of fixed point math. The zoom of the Mandelbrot includes the real x-axis number range from -2.25 to +.75, and the imaginary y-axis number range from -1.25 to +1.25, which with signed 4.12 numbers leaves plenty of slack. The inner-loop of the Mandelbrot algorithm is a sequence of mul's and add's of complex numbers. A Mandelbrot white paper describing how to calculate the Mandelbrot algorithm can be found following the link. Also, Mike Wall has an article on performance optimization in windows in which he uses a Mandelbrot sample for his explanation; full source code available. For the SSSE3 and SSE4.1 implementation, PMULHRSW was used to multiply these 4.12 fixed point numbers; two 4.12 numbers multiplied together creates an 8.24 32-bit number, and using the bit selection technique of PMULHRSW as illustrated in Figure 2, a rounded 7.9 fixed point number is written out as the packed result. The least significant fractional bit represents 2-9, which provides enough precision to render a faithful representation of the Mandelbrot set. Eventually, this product has to be left shifted by 3 bits to get back to the original 4.12 to continue the iterations of packed mul's and add's.
Figure 2: 4.12 PMULHRSW multiply; W=Whole, F=Fractional bit
This post gave the background of the optimization problem and described the operation of the PMULHRSW opcode. In Part II of my discussion, I will describe the technique I used to optimize the Mandelbrot fixed-point multiply for SSE2.
Kent Knox
Member of AMD Technical Staff
Kent Knox is a Member of Technical Staff in Solutions Enablement Engineering at AMD. His postings are his own opinions and may not represent AMD's positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.


Make a comment
Name:Website:

Links

RSS - SiteMap - Top