[openal] Panning, Ambisonics, and HRTF

Sat Sep 13 21:51:38 EDT 2014

Hey guys.

I'm not really sure who's here that could help with this (CC'ing Richard 
just in case he doesn't see it on the ML), but here's something I've 
been up to lately.

Over the past few days, I've been digging into ambisonics b-format 
audio. For the longest time I couldn't figure out how it worked, but I 
think I've finally got a handle on its structure and how to play them, 
thanks in part to

http://www.blueripplesound.com/b-format
and
http://www.blueripplesound.com/decoding

Granted, that's admittedly a simple decoding method and there's probably 
better ones available, but it works. I've made a simple program that can 
decode first- and second-order .amb files to a 5.1 stream. It would be 
nice to support them directly with the AL_EXT_BFORMAT extension, and 
work with the actual speaker configuration.

However, I'm a bit confused over how the coefficients were calculated. I 
think I can see the basic methodology:

channel[c].w_coeff = 0.7071
channel[c].x_coeff = channel[c].dir.x
channel[c].y_coeff = channel[c].dir.y
channel[c].z_coeff = channel[c].dir.z
channel[c].r_coeff = 0.7071 * (3 * channel[c].dir.z^2 - 1)
channel[c].s_coeff = 2 * channel[c].dir.z * channel[c].dir.x
...etc...

But there's extra attenuation applied at some point in the calculations 
based on the number of output speakers. I can't figure out where the 
attenuation values actually come from or how they get applied.

Also for 5.0, I can't figure out how the front-center speaker is 
factored into it. It seems to "steal" a little power from the other 
front speakers (I imagine it does this because the front-center speaker 
is not treated as a normal speaker), but again the scale is a mystery.

Figuring this out is important if I'm going to get OpenAL Soft's 6.1 and 
7.1 output to work with it, and/or allow users to manually tweak speaker 
positions (as they can currently). Plus I try to avoid using "magic" 
values that I don't know where they came from.

Relatedly, this gave me ideas on a different way to pan sounds within 
OpenAL Soft's current pipeline. Taking the idea that a panned mono sound 
could basically be encoded into b-format like:

w[s] = sample[s] * 0.7071
x[s] = sample[s] * pos.x
y[s] = sample[s] * pos.y
...etc...

and get rendered like:

out[s*num_channels + c] = channel[c].w_coeff*w[s] +
                           channel[c].x_coeff*x[s] +
                           channel[c].y_coeff*y[s] + ...

That can be altered to produce per-channel gains:

Gain[c] = channel[c].w_coeff*0.7071 + channel[c].x_coeff*pos.x +
           channel[c].y_coeff*pos.y + ...

and get mixed normally:

out[s*num_channels + c] += sample[s] * Gain[c];

This works. I have some uncommitted code locally that mixes this way to 
5.1 using complete second-order coefficients (taking care of the 
coordinate differences, obviously).

This has some interesting implications. First being that a sound would 
not be rendered as just a single point between the two nearest speakers, 
but instead be spread out a bit more. For 5.1 (and 6.1 and 7.1), it also 
means 3D sounds would not contribute that much to the front-center 
speaker, more using the front-left/right speakers instead and leaving 
the center more open for the AL_EFFECT_DEDICATED_DIALOGUE effect. I'm 
not sure if this all is actually better or not, compared to the current 
method.

But it also means it could be very easily extended to include speaker 
verticality, supporting something like 3D7.1 or 8-channel cube, with the 
proper coefficients. And even with 2D surround sound, it feels nicer 
mathematically when it comes to sounds that move up and down (but of 
course, having nice math does not mean good sound quality).

There's one other major issue with implementing b-format ambisonics, 
aside from calculating the coefficients. With HRTF, you're not really 
dealing with discrete output channels you mix directly to. The input 
samples get delayed and then filtered before mixing together. The main 
problem I see with feeding the b-format samples through HRTF is that the 
different axis could have different mixing delays associated with them, 
which would mess up the cancellation effect the omnidirectional feed 
provides for the panned samples.

There's a few ways I'm thinking of that could maybe fix this, but I'm 
not sure what would actually work or work well (e.g. decoding to an 
8-channel cube, or taking averages of the hrtf's coefficients, or 
premixing the w component with x, y, z, etc).

Anyway, thoughts and ideas are welcome. :)