Skip to main content

Multichannel & Immersive Formats

A spatial recording or mix has to be stored, transmitted and finally played back on some unknown collection of loudspeakers or headphones. The format is the contract that makes this possible: it defines what the numbers in the file mean, how many there are, and what a playback system is expected to do with them. Get the format right and the same content survives the journey from a dubbing stage to a phone; get it wrong and a carefully placed overhead effect ends up in the wrong speaker — or nowhere.

This chapter builds the vocabulary of formats from first principles. We start from the three fundamental ways of representing a spatial scene, then descend into the concrete layouts, metadata schemes, file carriers and codecs you will meet in practice. Throughout, the goal is to understand why each format exists and what it trades away, so that the later technique chapters — amplitude panning, Ambisonics, binaural — have a solid container to live in. If you have not yet read Spatial Psychoacoustics, it explains why the geometries below are arranged the way they are; this page explains how they are encoded.

1. The three representations: channel, object, scene

Every spatial audio system, no matter how exotic, encodes the sound field in one of three ways. The distinction is not cosmetic — it determines whether a mix is tied to a specific loudspeaker layout, and where the work of placing sound in space happens (at production time, or at playback time).

Channel-based. Each stored signal corresponds to one loudspeaker at a fixed, agreed-upon position. "Front left" is a channel; the file says "send these samples to the speaker that lives at 3030^\circ to the left of centre." The spatial information is baked in at mix time. Stereo, 5.1 and 7.1.4 are all channel-based. The representation is dead simple and robust — playback is just routing — but it is rigid: a 7.1.4 mix assumes you own a 7.1.4 rig, and if you do not, something has to downmix it, discarding information.

Object-based. Each sound is stored as an audio object: a mono (or multichannel) signal plus positional metadata describing where it should be — for example (x,y,z)(x, y, z) coordinates, or azimuth/elevation/distance, evolving over time. The signal is layout-agnostic. A renderer in the playback device reads the metadata and computes, in real time, the gains for whatever speakers are actually present. The spatial decision is made at production time (where should this be?) but the realization happens at playback time (which speakers achieve that?). This is the model behind Dolby Atmos and MPEG-H. It is flexible and future-proof, at the cost of a renderer and richer metadata.

Scene-based. The sound field itself is encoded, independent of both sources and speakers, as a set of spatial basis functions. Ambisonics is the canonical example: it stores a spherical-harmonic expansion of the pressure field at a point. There are no "channels for speakers" and no per-object metadata — just a mathematical description of the field, which a decoder projects onto any loudspeaker array (or onto a head, for binaural). Scene-based audio rotates trivially (crucial for VR head-tracking) and scales smoothly with order, but its spatial resolution is fixed by that order, and dense, pinpoint localization needs high orders and many channels.

A useful way to see the three is by where the geometry lives:

RepresentationGeometry lives inTied to a speaker layout?Rendered atTypical use
ChannelThe signals themselvesYesProductionBroadcast, cinema delivery, music
ObjectMetadata, per sourceNoPlaybackCinema, streaming, games
SceneSpatial basis functionsNoPlayback (decode)VR/AR, ambient capture, interchange
Key takeaway

These three representations are not mutually exclusive. Modern immersive systems are hybrid: a bed of channel-based audio (the stable ambience and music) plus a set of objects (the things that move), and sometimes an Ambisonic scene component for diffuse content.

The rest of this chapter takes each representation in turn, then turns to the files and codecs that carry them.

2. Channel-based layouts and the X.Y.Z notation

Channel-based formats are described by a compact notation of the form X.Y.Z:

  • X — the number of ear-level (horizontal-plane) loudspeakers, including the centre, left/right and surround speakers.
  • Y — the number of LFE (low-frequency effects) channels, almost always 00 or 11.
  • Z — the number of height (overhead or upper-layer) loudspeakers.

So 7.1.4 means seven ear-level speakers, one LFE, and four height speakers — twelve speakers, but only eleven of them are full-range. Older two-number notation (5.1, 7.1) predates height layers and simply omits Z (equivalently Z=0Z = 0). Plain stereo is 2.0: two ear-level speakers, no LFE, no height.

From stereo to immersive

Stereo (2.0) places two speakers at ±30\pm 30^\circ relative to the listener, forming an equilateral triangle with the listening position. Phantom images form between them by amplitude (and time) differences — the subject of the stereo chapter. Everything that follows is, in a sense, an attempt to extend that phantom-imaging idea around and above the listener.

5.1 adds a hard centre channel and two surround channels, plus the LFE. The canonical geometry is standardized in ITU-R BS.775: Left and Right at ±30\pm 30^\circ, Centre at 00^\circ, and the two surrounds at ±110\pm 110^\circ (i.e. behind the listener, 7070^\circ back from straight to the side). All five are nominally on a circle of equal radius around the reference listening point, at ear height. The LFE is non-directional and not placed on the circle.

7.1 splits the surround field into side and back pairs to improve rear imaging and lateral envelopment: typically side surrounds near ±90\pm 90^\circ and back surrounds near ±135\pm 135^\circ to ±150\pm 150^\circ. (Cinema 7.1 and the consumer "7.1" of Blu-ray differ slightly in exact angles, but the principle is the same.)

7.1.4 keeps the 7.1 ear-level ring and adds four height speakers, usually two front-height and two rear-height at roughly ±45\pm 45^\circ azimuth and +45+45^\circ elevation. This is the de facto reference layout for immersive music and home Atmos.

9.1.6 extends the ear-level ring to nine (adding front-wide speakers near ±60\pm 60^\circ) and the height layer to six. It is a common professional Atmos mixing-room layout and the maximum bed configuration in several systems.

A layouts table

LayoutX (ear-level)Y (LFE)Z (height)Total speakersTypical key angles (azimuth)
Mono (1.0)100100^\circ
Stereo (2.0)2002±30\pm 30^\circ
5.15106C 00^\circ, L/R ±30\pm 30^\circ, surr ±110\pm 110^\circ
7.17108adds side ±90\pm 90^\circ, back ±135\pm 135^\circ
5.1.4514105.1 + four heights at +45+45^\circ elev.
7.1.4714127.1 + four heights at +45+45^\circ elev.
9.1.691616adds front-wide ±60\pm 60^\circ, six heights
22.2222(three layers)24NHK Super Hi-Vision, three height tiers

22.2 (used in NHK's broadcasting research) is worth noting as the high-water mark of pure channel-based design: three layers — a middle layer of ten, an upper layer of nine, a lower layer of three — plus two LFEs. Its very existence illustrates the scaling problem that object- and scene-based audio were invented to solve: each new degree of immersion costs more dedicated channels.

The centre channel

The centre speaker deserves special mention because it is the one element with no equivalent in stereo. In stereo, a centre image is a phantom: it only exists for a listener seated exactly on the symmetry axis, and it suffers from the comb-filtering and timbre shift discussed under stereo localization. A real centre speaker anchors dialogue and lead vocals to a physical point, so the image stays put as the audience moves off-axis — which is precisely why cinema and broadcast put dialogue there. The cost is that mixing for a discrete centre is a different craft from mixing a phantom centre, and naive downmixes can make the centre too loud or too dry.

LFE versus bass management

The LFE channel (the ".1") is frequently misunderstood. It is not "the subwoofer channel." It is a dedicated, band-limited effects channel (roughly 2020120 Hz120\ \mathrm{Hz}) carrying content the mixer deliberately routes there — explosions, low rumble — and in cinema it carries an additional +10 dB+10\ \mathrm{dB} of in-band headroom relative to the main channels, so it is calibrated 10 dB10\ \mathrm{dB} hotter.

Bass management is a separate, playback-side process: a system with small "satellite" main speakers high-passes them (say above 80 Hz80\ \mathrm{Hz}) and sums the removed low end into the subwoofer, along with the LFE channel. So the subwoofer reproduces (bass-managed main content) + (LFE), but the LFE channel only ever held the effects content.

Common mistake

The LFE channel and the subwoofer are not the same thing. Confusing the two leads to gain errors of exactly that 10 dB10\ \mathrm{dB}.

Because the LFE is non-directional — below roughly 80 Hz80\ \mathrm{Hz} the auditory system cannot localize from interaural cues (see psychoacoustics) — its placement is acoustically free, which is why it sits outside the X.Y.Z geometry.

3. Object-based audio

Channel layouts answer "which speaker?" Object-based audio refuses to answer that question at production time, and instead stores intent. An object is:

object=s(t)mono audio essence  +  m(t)=(x(t),y(t),z(t),g(t),size,)time-varying positional metadata\text{object} = \underbrace{s(t)}_{\text{mono audio essence}} \;+\; \underbrace{\mathbf{m}(t) = \big(x(t),\, y(t),\, z(t),\, g(t),\, \text{size}, \dots\big)}_{\text{time-varying positional metadata}}

The audio essence s(t)s(t) is just a mono signal. The metadata m(t)\mathbf{m}(t) says where it should be — typically in a normalized room coordinate system where x,y,z[1,1]x, y, z \in [-1, 1] span the listening space — together with gain, perceived size (how spread or point-like), and flags such as snap-to-speaker. Crucially, no speaker is named.

The renderer

At playback, a renderer turns each object into speaker feeds for the actual layout in the room. For a given object position and a known set of loudspeaker positions, it computes per-speaker gains — most commonly with a vector-based amplitude panning law (covered in Amplitude Panning) so that the energy and the panning vector point at the intended location:

yi(t)=gis(t),igi2=1,y_i(t) = g_i \, s(t), \qquad \sum_i g_i^2 = 1,

where gig_i is the gain to speaker ii, derived from the object's position relative to the surrounding speaker triplet, and the constant-power normalization keeps loudness stable as the object moves. The same object metadata can drive a 5.1 living room, a 7.1.4 home theatre, a 64-speaker dubbing stage, or a binaural headphone renderer — each computes different gig_i from the same m(t)\mathbf{m}(t).

The core payoff of object audio

Author once, render everywhere. The same object metadata drives a 5.1 living room, a 7.1.4 home theatre, a 64-speaker dubbing stage or a binaural headphone renderer — each computes its own gains from one description.

Beds and objects

Pure object audio would be wasteful for content that does not move — ambient beds, music stems, reverb returns. So practical systems are hybrid. A bed is a fixed channel-based sub-mix (e.g. a 7.1.2 or 9.1.6 bed) that behaves like a set of static objects locked to canonical speaker positions, while objects are reserved for sounds that move or that need pinpoint placement. The renderer sums (rendered bed) + (rendered objects). This keeps channel count and metadata manageable while preserving flexibility where it matters.

ADM and the BW64 carrier

To exchange object-based content between tools and facilities, the industry needs a standard description language for beds, objects, positions and their relationships. That standard is the Audio Definition Model (ADM), ITU-R BS.2076. ADM is an XML metadata model — a tree of elements such as audioObject, audioPackFormat, audioChannelFormat, audioBlockFormat (the time-segmented position data) and audioTrackUID — that maps every track in a file to its role and, for objects, its time-varying position. ADM can describe all three representations: channel beds, dynamic objects, and Ambisonic scenes (via an HOA pack format).

ADM metadata needs a file to live in. That carrier is BW64 ("Broadcast Wave 64-bit"), standardized as ITU-R BS.2088 and built on the EBU Broadcast Wave format (EBU Tech 3285). BW64 is essentially a WAV file with 64-bit chunk sizes (so it can exceed the 4 GB4\ \mathrm{GB} limit of classic RIFF/WAV) and a dedicated axml chunk holding the ADM XML, plus a chna chunk mapping track indices to ADM IDs. The audio is plain PCM; the spatial intelligence is in the metadata. A single BW64+ADM file is therefore a complete, self-describing, codec-agnostic master of an immersive mix — the preferred interchange format precisely because it is uncompressed and fully specified.

The Dolby Atmos track budget

Dolby Atmos is the most widely deployed object-based system, and its production constraints make the bed-plus-objects model concrete. An Atmos master (delivered as a Dolby Atmos Master File, itself an ADM-based BWF) provides a fixed budget of 128 simultaneous "tracks", partitioned as:

  • one bed, up to 7.1.2 (ten channels), for static content; plus
  • up to 118 dynamic objects, each a mono essence with its own positional metadata.

10 (7.1.2 bed)+118 (objects)=128 tracks.10\ \text{(7.1.2 bed)} + 118\ \text{(objects)} = 128\ \text{tracks.}

A mixer spends this budget deliberately: the bed holds music and ambience, while the 118 objects are allotted to the moving and precisely placed elements. The renderer then collapses all 128 to whatever the venue has — be it a 5.1 soundbar or a commercial cinema — and the same master also feeds the binaural renderer used for headphone playback. For the panning mathematics behind object rendering, see Object-Based Audio.

4. Scene-based audio: Ambisonics

Scene-based audio encodes the sound field as a whole, independent of sources and speakers. The dominant scheme is Ambisonics, which represents the pressure field on a sphere around the listening point as a truncated expansion in spherical harmonics Ynm(θ,ϕ)Y_n^m(\theta, \phi):

p(θ,ϕ,t)    n=0Nm=nnanm(t)Ynm(θ,ϕ).p(\theta, \phi, t) \;\approx\; \sum_{n=0}^{N} \sum_{m=-n}^{n} a_n^m(t)\, Y_n^m(\theta, \phi).

The functions YnmY_n^m are a fixed set of angular "shapes" of increasing complexity, indexed by order nn and degree mm. The stored channels are the coefficients anm(t)a_n^m(t) — there is one per harmonic. The genius of the scheme is that the coefficients carry no assumption about loudspeakers: a decoder later projects them onto any array.

Order and channel count

For full 3D (periphonic) Ambisonics, the number of channels up to order NN is the number of spherical harmonics, which is the perfect square

K=(N+1)2.K = (N + 1)^2.

Order 00 is a single channel — the omnidirectional pressure, equivalent to a mono "W" signal. Order 11 ("first-order Ambisonics", FOA) adds three figure-of-eight components (the classic W, X, Y, Z B-format), for 44 channels. Each higher order adds a ring of finer angular detail, sharpening the spatial resolution but multiplying the channel count.

Order NNChannels (N+1)2(N+1)^2Added this order 2N+12N+1Approx. angular resolution
011none (omni / mono)
1 (FOA)43coarse; broad "blobs", good for ambience
295moderate
3 (TOA)167good; usable for music/VR
4259tight
53611very tight
64913near-pinpoint over a small sweet spot
76415high-end production/research

A rough rule of thumb for the minimum source separation an order can resolve is on the order of 360/(2N+2)\sim 360^\circ / (2N + 2), so first order resolves features no finer than roughly 9090^\circ while third order gets down toward 45\sim 45^\circ. The trade is explicit: resolution costs channels quadratically. This is the mirror image of the channel-based scaling problem — but here you can dial resolution continuously by truncating or extending the order, and the format rotates trivially (a sound-field rotation is a linear mix of the coefficients), which is why Ambisonics underpins VR/AR with head-tracking.

ACN ordering and normalization

To exchange Ambisonic files you must agree on two conventions: the order in which the channels are stored and the normalization (scaling) of each harmonic.

Channel ordering today follows ACN (Ambisonic Channel Number), a single index that flattens the (n,m)(n, m) pair:

ACN=n2+n+m.\mathrm{ACN} = n^2 + n + m.

So W is ACN 0, then the three first-order components are ACN 1, 2, 3 (in the order Y, Z, X), and so on. ACN is monotonic in order, which makes truncation as simple as keeping the first (N+1)2(N+1)^2 channels.

Normalization fixes how each YnmY_n^m is scaled. Two conventions dominate:

  • SN3D (Schmidt semi-normalized): keeps all components bounded so no harmonic ever exceeds the level of W; convenient for metering and the modern default.
  • N3D (full three-dimensional normalization): orthonormal, with Ynm2=\int |Y_n^m|^2 = constant; mathematically tidy for DSP, with higher orders scaled up relative to SN3D by a factor 2n+1\sqrt{2n+1}.

Conversion between them is a per-channel gain, so they are losslessly interchangeable as long as you know which you have.

AmbiX versus FuMa

The two convention bundles you will meet are:

  • AmbiX — ACN ordering + SN3D normalization. The modern standard, used by VR platforms and most current tools, and defined for arbitrary order.
  • FuMa (Furse-Malham) — the older "B-format" convention: a different channel order (W, X, Y, Z...), a different normalization (W attenuated by 1/21/\sqrt{2}), and only defined up to third order. Legacy, but still found in older recordings and microphone outputs.
Mixing conventions scrambles the field

AmbiX and FuMa are not compatible without an explicit conversion matrix; mixing them up scrambles the field. When in doubt, label your files — an Ambisonic file is meaningless without its (ordering, normalization) pair stated.

The decoding and microphone side of all this is covered in Ambisonics; here the point is simply that an Ambisonic file is meaningless without its (ordering, normalization) pair stated.

5. Binaural as a delivery format

Binaural audio is a two-channel format — one signal for each ear — and it is tempting to treat it as "just stereo." It is not, and the distinction matters for both production and delivery.

Stereo is made for loudspeakers. Its two channels are designed to be played into a room, where each ear hears both speakers (this is crosstalk), and phantom images form from the inter-speaker differences. Played on headphones, a stereo mix collapses inside the head: with no crosstalk and no acoustic path around the head, the spatial cues your auditory system expects are absent, so the image lateralizes left-to-right on a line through the skull rather than out in the world.

Binaural is made for headphones (or for crosstalk-cancelled loudspeakers — see transaural). Each channel is the signal that should arrive at one eardrum, already carrying the HRTF (head-related transfer function): the frequency- and direction-dependent filtering imposed by the head, torso and pinnae that the brain reads as elevation, front/back and externalization. A binaural signal for a source at direction dd is, schematically,

xL(t)=s(t)hL(t;d),xR(t)=s(t)hR(t;d),\begin{aligned} x_L(t) &= s(t) * h_L(t; d), \\ x_R(t) &= s(t) * h_R(t; d), \end{aligned}

where hL,hRh_L, h_R are the left/right HRTF impulse responses for that direction and * is convolution. Those HRTFs encode the very ITD, ILD and spectral cues the brain uses to place sound outside the head.

The practical consequences for a format:

  • A binaural file must be played on headphones (or transaural) to work; on speakers the HRTF filtering double-applies with the listener's own head and the image breaks.
  • Binaural is usually a render target, not an interchange master: you keep the channel/object/scene master and binauralize it for headphone delivery, because the binaural mix bakes in a specific HRTF set and head orientation. For interactive content, you instead ship the scene (often Ambisonic) plus an HRTF and binauralize live with head-tracking.
  • Two channels, then, can mean two completely different things — speaker-stereo or ear-binaural — and they are not interchangeable. Metadata or context must tell the player which it is.
Play binaural on the right device

A binaural file must be played on headphones (or transaural) to work. On speakers the HRTF filtering double-applies with the listener's own head and the image breaks. "Stereo" and "binaural" share a channel count and nothing else.

See Binaural for the synthesis and HRTF detail; the takeaway here is that "stereo" and "binaural" share a channel count and nothing else.

6. File carriers and codecs

So far we have discussed representations. Now: what file actually lands on disk or streams over the wire? Two layers must be distinguished:

  • the carrier/container (WAV, BW64, MP4...), which packages audio plus metadata, and
  • the codec (PCM, AC-4, MPEG-H...), which determines whether the audio is uncompressed or bit-rate-reduced and how channels/objects/scenes are coded.

WAV / BWF. Plain WAV is uncompressed linear PCM in a RIFF container — the universal master format for individual channels and small multichannel files. BWF (Broadcast Wave, EBU Tech 3285) adds a bext metadata chunk (timecode, originator, coding history). Both are limited to 4 GB4\ \mathrm{GB} by 32-bit size fields.

BW64 + ADM. As covered above, BW64 (ITU-R BS.2088) lifts the 4 GB4\ \mathrm{GB} limit and adds the axml/chna chunks carrying ADM (ITU-R BS.2076). PCM audio, full object/bed/scene description, codec-agnostic. This is the interchange and archival master of immersive audio — large, lossless, fully self-describing.

Dolby Digital Plus (E-AC-3) with Atmos. A lossy codec widely used for streaming and broadcast. It carries a 5.1 (or 7.1) core plus Dolby Atmos object metadata as a JOC (Joint Object Coding) extension: rather than coding 118 discrete object signals, JOC codes the core channels plus side-information that lets the decoder reconstruct the objects. Efficient, backward-compatible (legacy 5.1 decoders ignore the extension), bit-rates of order hundreds of kbit/s.

Dolby AC-4. Dolby's newer, more efficient next-generation codec for broadcast and streaming, with native object support, dialogue enhancement and loudness metadata. Lower bit-rates than E-AC-3 for comparable quality.

MPEG-H Audio (ISO/IEC 23008-3). An open ISO standard, object- and scene-based (it can carry channels, objects and HOA together), with interactivity (the listener can adjust dialogue, choose languages) and ADM compatibility. It is the audio system of ATSC 3.0 broadcasting and several streaming services. Bit-rate-scalable from broadcast to high-quality.

DTS:X. DTS's object-based immersive format, layout-agnostic like Atmos, used in cinema, Blu-ray and home theatre, built on the DTS-HD core.

Codec / carrierTypeLossy?Carries objects?Carries scene (HOA)?Typical use
WAV / BWF (PCM)Container + PCMNoVia ADM onlyVia ADM onlyStems, channel masters
BW64 + ADMContainer + PCMNoYesYesImmersive interchange / archive
Dolby Digital Plus + Atmos (JOC)CodecYesYes (JOC)NoStreaming, broadcast
Dolby AC-4CodecYesYesNoNext-gen broadcast/streaming
MPEG-H Audio (23008-3)CodecYesYesYesATSC 3.0, streaming
DTS:XCodecYesYesNoCinema, Blu-ray, home theatre
The pattern

Uncompressed PCM in BW64+ADM for mastering and exchange; a lossy object codec for delivery. You master in the first and encode to the second for each distribution path. None of the delivery codecs is an authoring format — you do not mix into AC-4, you export to it.

7. Sample rate and bit depth

A persistent myth holds that higher sample rates or bit depths make audio "more spatial." They do not, and understanding why sharpens your sense of where spatial information actually lives.

Sample rate fsf_s sets the highest frequency that can be represented. By the Nyquist limit, a sampled signal can only carry frequencies below half the sample rate:

fN=fs2.f_N = \frac{f_s}{2}.

At fs=48 kHzf_s = 48\ \mathrm{kHz}, fN=24 kHzf_N = 24\ \mathrm{kHz} — already above the 20 kHz\sim 20\ \mathrm{kHz} ceiling of human hearing. Spatial cues live in timing and level differences between channels, not in ultrasonic content. The interaural time difference that places a source — about ±0.6 ms\pm 0.6\ \mathrm{ms} at the extremes (see psychoacoustics) — is a continuous quantity that sampling captures through the phase of the band-limited signal; it is not quantized to the sample grid. A 48 kHz48\ \mathrm{kHz} signal has a sample period of

Ts=14800020.8 μs,T_s = \frac{1}{48000} \approx 20.8\ \mu\mathrm{s},

yet the effective timing resolution of a band-limited channel is far finer than TsT_s, because the reconstructed waveform is continuous between samples; sub-sample ITDs are represented faithfully. So 96 kHz96\ \mathrm{kHz} does not buy you spatial precision — it buys headroom for ultrasonic processing, gentler anti-alias filters, and lower distortion in nonlinear plugins. Useful, but not spatial.

Bit depth NN sets the dynamic range — the gap between the loudest and quietest representable level — via the quantization-noise relationship

DR6.02N+1.76  dB.\mathrm{DR} \approx 6.02\,N + 1.76\ \ \mathrm{dB}.

For N=16N = 16 bits this gives about 98 dB98\ \mathrm{dB}; for N=24N = 24 bits, about 146 dB146\ \mathrm{dB} — far beyond any playback chain's noise floor, which is why 2424-bit is the production standard (it gives margin for gain staging and summing, not audible "resolution"). For a 1616-bit master:

DR6.02×16+1.76=98.08 dB.\mathrm{DR} \approx 6.02 \times 16 + 1.76 = 98.08\ \mathrm{dB}.

Bit depth affects noise floor, not imaging. A whisper-quiet reverb tail buried beneath a high noise floor loses envelopment (see Direct, Diffuse and Envelopment) — so dynamic range indirectly protects spatial subtlety — but no number of bits changes where a source images.

Why 48 kHz and above? 48 kHz48\ \mathrm{kHz} is the professional and broadcast baseline: it clears the hearing range with margin, locks cleanly to video frame rates, and is the assumed rate of the immersive delivery codecs above. Rates of 96 kHz96\ \mathrm{kHz} appear in high-end music and post for the processing-headroom reasons noted, and immersive masters are typically delivered at 48 kHz/2448\ \mathrm{kHz}/24-bit unless a specific pipeline calls for more.

Rule of thumb

Spend your bandwidth on more channels/objects, not on more samples per channel — the spatial payoff is overwhelmingly in the former. Higher sample rates and bit depths buy processing headroom and a lower noise floor, not better imaging.

8. Interchange, future-proofing, and downmix/upmix

Because playback systems vary wildly and outlive any single mix, formats must survive translation across layouts. Two directions matter.

Downmix — fewer output channels than the mix. A 7.1.4 immersive mix must collapse gracefully to 5.1, to stereo, and to mono. Channel-based downmix is done with a fixed downmix matrix D\mathbf{D} mapping the MM input channels to L<ML < M outputs:

y=Dx,DRL×M.\mathbf{y} = \mathbf{D}\,\mathbf{x}, \qquad \mathbf{D} \in \mathbb{R}^{L \times M}.

The classic ITU stereo downmix of 5.1, for instance, folds centre into both L and R at 3 dB-3\ \mathrm{dB} and the surrounds in at a chosen attenuation:

Lo=L+12C+αLs,Ro=R+12C+αRs,\begin{aligned} L_o &= L + \tfrac{1}{\sqrt{2}}\,C + \alpha\,L_s, \\ R_o &= R + \tfrac{1}{\sqrt{2}}\,C + \alpha\,R_s, \end{aligned}

with α\alpha a standard surround coefficient (often 3-3 or 6 dB-6\ \mathrm{dB}). Object-based audio sidesteps fixed matrices entirely: the renderer simply re-renders the same objects to the smaller layout, which is one of the strongest arguments for object delivery. The matrix machinery, and the history of matrix surround that pioneered it, are detailed in Surround and Matrix Encoding.

Upmixmore outputs than the source. Turning stereo into 5.1 or an immersive bed is fundamentally an estimation problem: the spatial information was never recorded, so an upmixer must infer it, typically by decomposing the signal into a primary (directional, correlated) component and an ambient (diffuse, decorrelated) component and steering each to appropriate speakers. This is far more delicate than downmix and never fully recovers a true multichannel original. Importantly, though, even a humble two-channel signal already contains exploitable spatial structure — inter-channel correlation, level and phase relationships — which is why upmixers can work at all; the stereo-is-already-spatial chapter unpacks that latent information.

Future-proofing in one principle

Keep the most general representation as your master. An object/scene-based BW64+ADM master can be rendered down to any present or future layout, while a channel-locked stereo or 5.1 master can only ever be upmixed by guesswork. Archive the representation with the geometry not baked in.

9. Choosing a format by use case

There is no single best format; the right choice falls out of where and how the content will be heard. Three archetypes:

Fixed installation (cinema, dome, museum, mixing room). The speaker layout is known and permanent, the listening area may be large, and you want maximal control. Here a channel-based delivery to the exact installed layout (or an object master rendered once to that layout) is ideal: you can calibrate the room to the channels and rely on them being there. For large or unusual arrays, scene-based or Wave Field Synthesis approaches become attractive because they decouple the mix from the speaker count. Permanence rewards specificity.

Travelling content (streaming, broadcast, Blu-ray). The playback system is unknown and varies per viewer — a soundbar here, a 5.1 there, a phone speaker somewhere else. This is the home turf of object-based delivery in a lossy codec (Dolby Digital Plus + Atmos, AC-4, MPEG-H, DTS:X): author once, let each device's renderer adapt. The master stays BW64+ADM; you encode out to whichever codec the platform demands. Adaptability is the whole game.

Headphones (mobile, VR/AR, personal listening). Two ears, no room, no crosstalk. The delivery target is binaural, but how you get there matters: for fixed linear content, binauralize a channel/object master into a two-channel binaural file (or carry the object stream and binauralize on-device, as streaming Atmos-for-headphones does). For interactive content with head-tracking, ship a scene (typically Ambisonic) plus an HRTF and render binaurally in real time, so that head rotation just rotates the field. Personalization (individual HRTFs) lives in this branch too.

A compact decision aid:

Use caseSpeaker layout known?Preferred masterPreferred delivery
Fixed installYes, permanentChannel or objectChannel to that layout
Travelling contentNo, variesObject (BW64+ADM)Object codec (Atmos/AC-4/MPEG-H/DTS:X)
Headphones, linearN/A (two ears)Object or channelBinaural render
Headphones, interactiveN/AScene (Ambisonic) + HRTFReal-time binaural with head-tracking

The through-line of this chapter: decide what your content is (channel, object, or scene), master in the most general form that captures it, and let the delivery format and renderer adapt to each playback reality. Formats are not the art — but choosing the wrong one quietly throws the art away.

References

  1. ITU-R Recommendation BS.775Multichannel stereophonic sound system with and without accompanying picture. International Telecommunication Union. (Defines 5.1/3-2 geometry, downmix coefficients.)
  2. ITU-R Recommendation BS.2076Audio Definition Model (ADM). International Telecommunication Union. (The metadata model for channel/object/scene content.)
  3. ITU-R Recommendation BS.2088Long-form file format for the international exchange of audio programme materials with metadata (BW64). International Telecommunication Union.
  4. EBU Tech 3285Specification of the Broadcast Wave Format (BWF). European Broadcasting Union, Geneva.
  5. Dolby Laboratories. Dolby Atmos — Specifications and Dolby Atmos Renderer / Master File documentation. Dolby Laboratories, Inc.
  6. ISO/IEC 23008-3Information technology — High efficiency coding and media delivery in heterogeneous environments — Part 3: 3D audio (MPEG-H 3D Audio). International Organization for Standardization.
  7. F. Rumsey. Spatial Audio. Focal Press, Oxford, 2001.
  8. M. A. Gerzon. "Periphony: With-Height Sound Reproduction." Journal of the Audio Engineering Society, 21(1), 2–10, 1973. (Foundational Ambisonics.)

← Back to Fundamentals