[openal] Panning, Ambisonics, and HRTF [also, TEEM]

Sun Sep 14 05:13:24 EDT 2014

Thanks Chris - I've reattached myself to the list to respond. Sorry if I've missed any posts from others in the interim. Also, apologies if this comes through twice (I used the wrong email initially).

Excellent - it sounds as if you might be getting the Ambisonic Bug! I'd thoroughly recommend it - IMHO it's definitely the best way to capture a 3D audio scene if you don't need interactivity (or can mix it on the fly!). If you want to find out more, the "sursound" mailing list is the place to hang out/ask questions etc. And take a peek at our (Blue Ripple Sound's) new state-of-the-art studio products for mixing in Reaper in 3D using Third Order Ambisonics (TOA). The starter pack (TOA Core) is free (though the others definitely aren't).

What you're describing below is essentially how the Rapture3D OpenAL driver works. You might be interested in a paper on the topic I wrote for an AES Games conference a few years back, which seems directly relevant: "Building An OpenAL Implementation Using Ambisonics".

Your puzzlement about 5.0 coefficients and HRTF handling is very understandable. Be aware that you're standing on the edge of the Ambisonic Rabbit Hole, peering down. Take another step - it'll be fun ;-) Ambisonic decoding (AKA rendering) is the REALLY hard bit of work with Ambisonics. You've clearly worked this out already, but for the benefit of others, this is the part when you take the "B-Format" internal representation, which describes an abstract soundfield that is independent of any specific speaker layout, and process it to produce feeds for a specific speaker layout (or headphones) with the intent to produce an approximation to the abstract soundfield in the room (or ears).

Blue Ripple Sound was actually set up after I'd spent 10+ years slightly(?) obsessed by Higher Order Ambisonic decoding and felt happy that I'd cracked it sufficiently to try releasing a commercial product, and games happened to be a good vehicle (because mixing happens live, so the user never has to worry about the horrid B-Format). We don't publicise exactly how our decoder calculates its coefficients, but I can tell you about the decoder coefficients for 5.0 that are on the Blue Ripple website as a reference decoder. The decoders on that page are the same as the ones in Csound's bformdec1 decoder. These in turn were assembled by myself, Fons Andriaensen and Bruce Wiggins (you'll see these names again if you get into practical Ambisonics). The 5.0 decoder is one of Bruce's, and if you want to find out how it was derived you need to read "The Design and Optimisation of Surround Sound Decoders Using Heuristic Methods" by Bruce and others (can't remember exactly where that was published). This definitely isn't the only way to derive decoders for irregular layouts, particularly at higher orders with more coefficients (go on, just another step...) and there isn't a "right answer" as such because it depends on your design criteria. For the mathematical underpinnings, my bible is "Theoretical Acoustics" by Morse and Ingard, but this is a pretty fierce text and uses some unusual conventions. There are probably easier texts out there (that said, please let me know if you find them). It's also worth reading everything ever published by Michael Gerzon.

On HRTF decoders, there's a brief discussion in my paper above, but the devil is in the detail...

Foof. Enuf for now!

BBBBbut... This reminds me! You folk might be interested in another project we've been working on which might have some overlap with what goes on here. You are probably aware of technologies like Dolby Atmos which use "object" based audio for cinema (also, see MPEG-H and DTS MDA). This is where multiple mono tracks are spatialized in 3D (sound familiar?), along with multichannel beds for everything else. The basic shape is similar but different to OpenAL's because the formats are linear and not interactive - everything is on the same timeline and there is no distance modelling, Doppler/SRC and so on to worry about, as this has hopefully been handled in the studio. There is standards work happening in SMPTE (cinema) and MPEG (domestic) around this, plus some done already at EBU, but it's not clear yet if any of this will result in an open / patent-free interface or protocol for interchange. Because of that, we've published a prototype software interface for this sort of thing and some thoughts at http://www.blueripplesound.com/teem and we have an associated SDK which we may open at some point. The interface has actually moved on (slightly) since we put it on the website, but we'd love to know what you all think about it. In particular, if any experts on lossy compression want to get involved that would be particularly exciting!

Many thanks,

--Richard

> -----Original Message-----
> From: Chris Robinson [mailto:chris.kcat at gmail.com]
> Sent: 14 September 2014 02:52
> To: openal mailing list
> Cc: Richard Furse
> Subject: Panning, Ambisonics, and HRTF
> 
> Hey guys.
> 
> I'm not really sure who's here that could help with this (CC'ing Richard
> just in case he doesn't see it on the ML), but here's something I've
> been up to lately.
> 
> 
> Over the past few days, I've been digging into ambisonics b-format
> audio. For the longest time I couldn't figure out how it worked, but I
> think I've finally got a handle on its structure and how to play them,
> thanks in part to
> 
> http://www.blueripplesound.com/b-format
> and
> http://www.blueripplesound.com/decoding
> 
> Granted, that's admittedly a simple decoding method and there's probably
> better ones available, but it works. I've made a simple program that can
> decode first- and second-order .amb files to a 5.1 stream. It would be
> nice to support them directly with the AL_EXT_BFORMAT extension, and
> work with the actual speaker configuration.
> 
> However, I'm a bit confused over how the coefficients were calculated. I
> think I can see the basic methodology:
> 
> channel[c].w_coeff = 0.7071
> channel[c].x_coeff = channel[c].dir.x
> channel[c].y_coeff = channel[c].dir.y
> channel[c].z_coeff = channel[c].dir.z
> channel[c].r_coeff = 0.7071 * (3 * channel[c].dir.z^2 - 1)
> channel[c].s_coeff = 2 * channel[c].dir.z * channel[c].dir.x
> ...etc...
> 
> But there's extra attenuation applied at some point in the calculations
> based on the number of output speakers. I can't figure out where the
> attenuation values actually come from or how they get applied.
> 
> Also for 5.0, I can't figure out how the front-center speaker is
> factored into it. It seems to "steal" a little power from the other
> front speakers (I imagine it does this because the front-center speaker
> is not treated as a normal speaker), but again the scale is a mystery.
> 
> Figuring this out is important if I'm going to get OpenAL Soft's 6.1 and
> 7.1 output to work with it, and/or allow users to manually tweak speaker
> positions (as they can currently). Plus I try to avoid using "magic"
> values that I don't know where they came from.
> 
> 
> Relatedly, this gave me ideas on a different way to pan sounds within
> OpenAL Soft's current pipeline. Taking the idea that a panned mono sound
> could basically be encoded into b-format like:
> 
> w[s] = sample[s] * 0.7071
> x[s] = sample[s] * pos.x
> y[s] = sample[s] * pos.y
> ...etc...
> 
> and get rendered like:
> 
> out[s*num_channels + c] = channel[c].w_coeff*w[s] +
>                            channel[c].x_coeff*x[s] +
>                            channel[c].y_coeff*y[s] + ...
> 
> That can be altered to produce per-channel gains:
> 
> Gain[c] = channel[c].w_coeff*0.7071 + channel[c].x_coeff*pos.x +
>            channel[c].y_coeff*pos.y + ...
> 
> and get mixed normally:
> 
> out[s*num_channels + c] += sample[s] * Gain[c];
> 
> This works. I have some uncommitted code locally that mixes this way to
> 5.1 using complete second-order coefficients (taking care of the
> coordinate differences, obviously).
> 
> This has some interesting implications. First being that a sound would
> not be rendered as just a single point between the two nearest speakers,
> but instead be spread out a bit more. For 5.1 (and 6.1 and 7.1), it also
> means 3D sounds would not contribute that much to the front-center
> speaker, more using the front-left/right speakers instead and leaving
> the center more open for the AL_EFFECT_DEDICATED_DIALOGUE effect. I'm
> not sure if this all is actually better or not, compared to the current
> method.
> 
> But it also means it could be very easily extended to include speaker
> verticality, supporting something like 3D7.1 or 8-channel cube, with the
> proper coefficients. And even with 2D surround sound, it feels nicer
> mathematically when it comes to sounds that move up and down (but of
> course, having nice math does not mean good sound quality).
> 
> 
> There's one other major issue with implementing b-format ambisonics,
> aside from calculating the coefficients. With HRTF, you're not really
> dealing with discrete output channels you mix directly to. The input
> samples get delayed and then filtered before mixing together. The main
> problem I see with feeding the b-format samples through HRTF is that the
> different axis could have different mixing delays associated with them,
> which would mess up the cancellation effect the omnidirectional feed
> provides for the panned samples.
> 
> There's a few ways I'm thinking of that could maybe fix this, but I'm
> not sure what would actually work or work well (e.g. decoding to an
> 8-channel cube, or taking averages of the hrtf's coefficients, or
> premixing the w component with x, y, z, etc).
> 
> 
> Anyway, thoughts and ideas are welcome. :)