max word value SSE2
Not sure about everyone, but lots of programmers from time to time got task that on one of stage require to find maximum from unsorted array/memory/… of data. And the best
i was working on detecting microphone sound level. So i received buffer of 16 bit samples and to detect level i had to find max absolute value. The simplest way just go through buffer and found max value:
__int16 find_abs_max(__int16 * data, unsigned size)
{
__int16 max = 0;
for (unsigned i = 0; i < size; ++i)
{
if (abs(data[i]) > max)
max = abs(data[i]);
}
return max;
}
what can we do to increase performance? sure we can sort data and we will have max value, but sometimes especially when we have huge array of data fast sort algorithm wont help for example some sort algorithms require one more additional buffer with same size. if algorithm does not require additional buffer can happen that we cannot modify current data buffer as it should be passed somewhere as in my case…
So if you not satisfied performance you could try SIMD as I did.
__int16 find_abs_max_sse(void * data, unsigned size)
{
__declspec(align(16)) __int16 return_buf[8];
__int16 * dst = &return_buf[0];
__int32 _size = size >> 4;
__asm
{
mov ecx, _size
mov esi, data
mov edi, dst
pxor xmm0, xmm0
pxor xmm6, xmm6
pxor xmm7, xmm7
NEXT:
movupd xmm1, [esi]
pmaxsw xmm6, xmm1
pminsw xmm7, xmm1
add esi, 16
dec ecx
or ecx, ecx
jnz NEXT
psubw xmm0, xmm7
pmaxsw xmm6, xmm7
movdqa [edi], xmm6
};
__int16 max = 0;
for (unsigned char i = 0; i < 8; ++i)
{
if (return_buf[i] > max)
max = return_buf[i];
}
return max;
}
measurement showed that SSE implementation 20 times faster (!!!)
if we dont need max by absolute value it should twice faster!
P.S. my SSE implementation work 100% correct only if buffer size in bytes dividable by 16 if its not true then we skip couple samples at the end, but its not difficult to fix.
Filed under: Programming
October 19th, 2010 at 6:28 pm
“buffer size in bytes dividable by 16″ - usualy it’s called a “Word aligned buffer” ;))
October 19th, 2010 at 7:39 pm
Denis, i think 16 bytes aligned size sounds better
But this algorithm just skip couple last bytes, its not a big deal in most cases.
P.S. skype me