[openal] [PATCH V2] Add some mixer SSE2/4.1 optimisations

Timothy Arceri t_arceri at yahoo.com.au
Tue Jun 3 09:28:20 EDT 2014

On Tue, 2014-06-03 at 03:09 -0700, Chris Robinson wrote:
> On 06/03/2014 12:33 AM, Timothy Arceri wrote:
> > +        pos_arr[0] = _mm_cvtsi128_si32(pos4);
> > +        pos_arr[1] = _mm_cvtsi128_si32(_mm_shuffle_epi32(pos4, _MM_SHUFFLE(1,1,1,1)));
> > +        pos_arr[2] = _mm_cvtsi128_si32(_mm_shuffle_epi32(pos4, _MM_SHUFFLE(2,2,2,2)));
> > +        pos_arr[3] = _mm_cvtsi128_si32(_mm_shuffle_epi32(pos4, _MM_SHUFFLE(3,3,3,3)));
> I'm quite surprised SSE2 doesn't seem to have an _mm_store_epi32 method. 
> Or is it somewhere I'm not seeing? I wonder if it would be worth doing
> _mm_store_ps((float*)pos_arr, _mm_castsi128_ps(pos4))
> despite the ugly aliasing (assuming it would work, of course).

I was surprised that no version of SSE seems to have an _mm_store_epi32
(well I couldn't find it anyway).

Yes that does seem to work (at least in my test) and also seems to
perform much better. My SSE2 resample code was taking around 4.45% of
cpu with this change its down to 2.22%. For reference the C code is at
6.23% and SSE4.1 1.5%.

For the record before your patch to avoid the loop in MixSource and my
patch OpenAL was reported by callgrind to be using 14.20% of cpu. With
my patch running SSE4.1 its down to 6.66% and with your update to the
SSE2 code its down to 7.32% (both results include the loop avoidance

So a 48%-55% improvement which is pretty impressive. I was hoping this
drop in cpu might help with increasing frame rates for the linux opengl
drivers as some of the open drivers are limited by cpu but this didn't
turn out to be the case (at least with my test gpu).  

More information about the openal mailing list