Skip to main content

Surround & Matrix Systems

Surround sound is the family of spatial techniques that grew directly out of cinema and consumer stereo. Where stereo places sound across a single arc of two loudspeakers, surround wraps the listener in a ring — typically five full-range channels plus a low-frequency effects channel, the familiar "5.1". For three decades it was the dominant format for film, broadcast, music and games, and it remains the lingua franca of countless rooms and devices today.

This chapter treats surround through the recurring lens of this guide: encode then decode. Every spatial technique authors a field into some finite representation, then derives the loudspeaker or headphone feeds for the system that is actually present. Channel-based surround is the limiting case where the encode and decode are collapsed into one: the representation is the loudspeaker layout. That collapse is what makes channel audio simple, robust and universally understood — and also what makes it fragile, layout-locked, and ultimately the thing that object-based audio was invented to replace. Along the way the matrix systems — Dolby Stereo, Pro Logic — make the encode/decode split briefly visible again, and in doing so they reveal that matrix decoding is upmixing, a theme picked up in stereo is already spatial.

Channel-Based Audio: The Encode/Decode Collapse

What "channel-based" means

In a channel-based format, the signal stored or transmitted is a fixed set of audio channels, each defined by the loudspeaker it is meant to drive. A 5.1 mix contains a Left channel, a Right channel, a Centre channel, a Left Surround, a Right Surround, and an LFE channel. The semantics of the format are entirely positional: "Left" means "the signal for the loudspeaker at the front left". There is no separate description of where a sound is — only of which speaker plays what.

Contrast this with the general encode/decode pattern. In Ambisonics you encode a sound field into spherical-harmonic coefficients that are independent of any speaker layout, then decode to whatever array you have. In object-based audio you transmit audio objects plus position metadata, and a renderer computes speaker gains at playback time. In both, encode and decode are distinct stages separated in space and time.

Channel-based surround can be described in exactly the same vocabulary — but the two stages are fused at authoring time. The mix engineer, using amplitude panning (see amplitude panning), encodes a source at angle θ\theta into channel gains; the result is baked into the channels. The "decode" is then the trivial identity map: channel ii goes to loudspeaker ii.

The 1:1 layout mapping

Formally, let a source signal s(t)s(t) be panned to a set of NN channels with gains gig_i:

ci(t)=gis(t),i=1N.c_i(t) = g_i \, s(t), \qquad i = 1 \dots N .

The decode is simply

feedi(t)=ci(t),\text{feed}_i(t) = c_i(t),

i.e. the decode matrix is the identity IN\mathbf{I}_N. The entire spatial intent lives in the gains gig_i chosen during the encode, and those gains were computed for one specific geometry — the reference layout. Playback is correct only if the loudspeakers sit where the encoder assumed they would.

Key takeaway

A channel mix is a decode that has already happened, frozen into the file. This is what makes a 5.1 mix trivially playable — no renderer, no metadata, no computation — and also what makes it brittle: everything that isn't the reference layout (stereo headphones, a soundbar, a 7.1.4 room, a phone speaker) must re-derive speaker feeds from channels that were never meant to be re-derived. That re-derivation is downmix and upmix, and it occupies much of this chapter.

Why the collapse is fragile

Three fragilities follow immediately, and we return to each in depth:

  1. Geometric fragility. The panning gains assume specific speaker angles. Move the surrounds from ±110\pm 110^\circ to ±90\pm 90^\circ and every panned phantom shifts.
  2. Layout fragility. A 5.1 file has no meaning on a 22.2 system or a pair of earbuds without a conversion step that guesses the original spatial intent.
  3. Count fragility. Adding height or more surrounds means a new channel format, a new mix, and a new delivery — not a re-decode of the same data.

These are not implementation details; they are structural consequences of fusing encode and decode. Hold them in mind as we build up the rest of the system.

ITU-R BS.775: The Reference Geometry

The 3/2 arrangement

The international reference for multichannel surround is ITU-R Recommendation BS.775. It specifies the "3/2 stereo" arrangement — three front channels, two surround channels — that underlies all 5.1 work, together with an optional LFE.

The listener sits at the centre of a circle of radius rr (the reference distance, commonly 2–4 m, with all speakers equidistant). Azimuth θ\theta is measured from the forward direction, positive to the right (conventions vary; we use front =0= 0^\circ). The canonical positions are:

  • Centre (C): 00^\circ.
  • Left (L) / Right (R): ±30\pm 30^\circ. These are the same angles as conventional two-channel stereo, so an L/R pair is a stereo-compatible front.
  • Left Surround (LS) / Right Surround (RS): ±110\pm 110^\circ (the recommendation permits a range of roughly ±100\pm 100^\circ to ±120\pm 120^\circ).
  • LFE: a low-frequency effects channel, band-limited to roughly 20–120 Hz, with no defined direction (a subwoofer is largely non-localisable in this band — see psychoacoustics).

All main speakers should ideally lie on a circle so that path lengths — and therefore arrival times and levels — are equal at the reference seat. Equal height (ear level) is assumed; BS.775 in later revisions also describes height and additional layouts, but the 3/2 core is the part everyone shares.

Why these angles?

The front ±30\pm 30^\circ inherits directly from stereo: a 6060^\circ included angle gives a usefully wide front image while keeping phantom imaging stable for a centred listener. The centre speaker fills the middle of that arc with a real source.

The surrounds at ±110\pm 110^\circ are a deliberate compromise. They are pushed behind the listener's ±90\pm 90^\circ interaural axis so that they generate a sense of envelopment and "behind-ness" rather than crisp lateral localisation. Human localisation is poorest to the sides and rear (the "cone of confusion" and sparse rear cues), so surrounds are not expected to produce precise phantom images between LS and RS — the gap from 110110^\circ to 250250^\circ (i.e. behind the head) is large and only loosely filled. Surround was designed for ambience and effects, not for pinpoint rear sources. This is a key perceptual asymmetry: the front three are for accurate imaging, the rear two for envelopment (see direct, diffuse and envelopment).

7.1 variants: side vs back

"7.1" adds two channels to 5.1, but there are two incompatible conventions for where they go:

  • Cinema 7.1 (Dolby, "side + back"): the BS.775 surrounds move to the sides at roughly ±90\pm 90^\circ (Left/Right Surround), and two back surrounds are added at roughly ±135\pm 135^\circ to ±150\pm 150^\circ (Left/Right Back). This gives a side pair and a rear pair, improving front-to-back panning continuity.
  • Consumer / SDDS-style 7.1: sometimes used the two extra channels as front wide channels between L/R and C. This is now rare.
Labels do not equal geometry

Because "Ls/Rs" can mean ±110\pm 110^\circ in 5.1 but ±90\pm 90^\circ in cinema 7.1, channel labels are not enough to know geometry — yet another symptom of the encode/decode collapse, where the channel name is supposed to carry the position but does so only by external convention.

Layouts table

LayoutFrontSurroundBackHeightLFETotal
2.0 stereoL 30-30, R +30+302
3.0L, C, R (0,±300, \pm30)3
5.1 (BS.775)L, C, RLs/Rs ±110\pm11016
6.1L, C, RLs/Rs ±110\pm110Bc (mono back)17
7.1 cinemaL, C, RLs/Rs ±90\pm90Lb/Rb ±135\pm13518
7.1.4 (immersive)L, C, RLs/Rs ±90\pm90Lb/Rb ±135\pm1354 tops112
9.1.6L, C, R + widessidesbacks6 tops116
22.2 (NHK)multiplemultiplemultipleupper + lower224

The proliferation of rows is itself the argument of this chapter: each is a separate format requiring a separate mix, because the channels are the layout. The formats taxonomy is treated in formats.

The Centre Channel

Phantom vs discrete centre

In two-channel stereo, a centred image is a phantom: equal signals in L and R fuse, by the summing-localisation mechanism (see psychoacoustics), into an apparent source between the speakers. The phantom is acoustically real to the listener but exists nowhere physically.

The defining addition of 3/2 surround over stereo is a discrete centre loudspeaker at 00^\circ. Now a centred sound can be a real source radiating from the centre position, rather than a phantom synthesised from L and R. This brings three advantages.

Off-centre stability

A phantom centre is exquisitely sensitive to listener position. Move off the median plane and the precedence effect (see psychoacoustics) takes over: the nearer speaker arrives first and louder, and the phantom collapses into that speaker. The shift in interaural time difference for a lateral displacement Δx\Delta x toward (say) the left speaker is roughly

ΔtΔdc=Δx(cosθnearcosθfar)c,\Delta t \approx \frac{\Delta d}{c} = \frac{\Delta x \,(\cos\theta_{\text{near}} - \cos\theta_{\text{far}})}{c},

and even a few hundred microseconds of inter-speaker delay is enough to pull the image hard to one side. A discrete centre is immune: it is a single real source, so wherever you sit the centre energy still comes from the centre speaker. In a cinema with a hundred seats spread across tens of degrees, this is the difference between dialogue locked to the screen and dialogue smearing to the wall.

Dialogue anchoring

In film, dialogue lives almost entirely in the centre channel. Anchoring speech to a discrete centre keeps voices nailed to the actors on screen for every seat in the house, and crucially it separates dialogue from the L/R music-and-effects bed. That separation has downstream value: it lets broadcasters offer dialogue-level adjustment, alternate languages, and cleaner downmixes. It is also why centre-channel level errors are so audible — get the centre wrong and the most important content in the mix (the words) is wrong.

The cost: a third front channel to fill

A discrete centre is not free. Panning a source between L and C, or C and R, now involves three front speakers, and the centre's presence changes the timbre of frontal phantoms (a real centre source has different comb-filtering at the ears than an L+R phantom). Some music engineers deliberately under-use the centre, or fold a phantom-centre stereo bed across L/R while reserving C for specific elements, precisely to control this. The centre is powerful but it constrains the front imaging in ways pure stereo does not.

LFE versus Bass Management

These two are constantly confused, yet they are different things living on opposite sides of the encode/decode boundary. Getting this right is essential for correct level and correct bass.

The LFE channel: a creative .1

The LFE (Low-Frequency Effects) channel is the ".1" in "5.1". It is a content channel — part of the encode — carrying deliberately authored low-frequency material: explosions, rumble, the visceral bottom of a kick. Two facts define it:

  1. It is band-limited to roughly 20–120 Hz (the "0.1" name reflects its restricted bandwidth, about a tenth of a full-range channel, not a tenth of a speaker).
  2. It carries a +10+10 dB in-band gain headroom. By the standard (originating in Dolby cinema practice and codified for AC-3), the LFE channel is recorded and decoded with an extra +10+10 dB of in-band level relative to a single main channel. This lets the format store very loud low-frequency effects without clipping the main channels, since a single explosion might need more low-end energy than one full-range channel's headroom allows.

In linear terms, +10+10 dB is a voltage/amplitude factor of

1010/20=100.53.162.10^{10/20} = 10^{0.5} \approx 3.162 .

So when a decoder routes LFE to a subwoofer, it must re-apply that +10+10 dB: the channel was authored 1010 dB hotter relative to mains on the meter, but it must be reproduced 1010 dB louder again to sit correctly. (Equivalently, calibration places the LFE so that its in-band SPL is +10+10 dB relative to a single main channel for the same digital level.)

Common mistake

Forgetting the +10+10 dB LFE gain is a classic mixing error: the bass effects come out anaemic.

The LFE is optional content. Many mixes barely use it; a film with no authored sub-bass effects can have a near-silent LFE channel. It is not "the subwoofer channel" — that is the confusion we now untangle.

Bass management: a playback-side decode

Bass management is something else entirely. It is a playback-side process — part of the decode/reproduction chain — that exists because real rooms cannot put a full-range loudspeaker at every position. Small satellite speakers cannot reproduce 4040 Hz cleanly, so the system redistributes low frequencies to a subwoofer.

A bass manager does the following to every main channel:

  1. Apply a high-pass filter (typically a 2nd- or 4th-order filter at a crossover frequency fc80f_c \approx 80 Hz, the THX-standard value) to each main channel, removing the deep bass the small speaker cannot handle.
  2. Sum the removed low-frequency portions of all mains together with the LFE channel.
  3. Apply a low-pass filter to that sum and send it to the subwoofer.

So the subwoofer typically reproduces (LPF of  [iLF(ci)+LFE])\bigl(\text{LPF of}\;[\,\sum_i \text{LF}(c_i) + \text{LFE}\,]\bigr), while the mains reproduce their high-passed selves. Schematically, the sub signal is

sub(t)=HLP ⁣[  i=15HLP(ci(t))bass stripped from mains  +  gLFELFE(t)gLFE=3.162 (+10dB)  ],\text{sub}(t) = H_{\text{LP}}\!\left[\; \underbrace{\sum_{i=1}^{5} H_{\text{LP}}\bigl(c_i(t)\bigr)}_{\text{bass stripped from mains}} \;+\; \underbrace{g_{\text{LFE}}\, \text{LFE}(t)}_{g_{\text{LFE}} = 3.162\ (+10\,\text{dB})} \;\right],

and each main is reproduced as HHP(ci(t))H_{\text{HP}}(c_i(t)).

Why they are different

The distinction is now clear:

  • The LFE channel is content authored into the mix (encode side). It is a creative decision by the mixer. It exists in the file whether or not the playback system has a subwoofer.
  • Bass management is a reproduction strategy (decode side). It depends entirely on the speakers in the room — their low-frequency capability and the crossover chosen. A room with five genuinely full-range mains needs no bass management at all, yet still has an LFE channel to reproduce.
Do not confuse LFE with the subwoofer feed

A subwoofer in a bass-managed system plays two distinct things: the redistributed low end of the main channels (managed bass) and the LFE content (with its +10+10 dB applied). They arrive at the same physical box but for completely different reasons. Confusing "LFE" with "the subwoofer feed" leads to wrong levels, double-counted bass, and the perennial question of whether full-range mains should also dump their bass into the sub.

A worked routing example

Suppose a 5.1 decoder feeds a system with five small satellites (capable down to 8080 Hz) plus one subwoofer. Crossover fc=80f_c = 80 Hz, 4th-order Linkwitz–Riley.

Take a single main channel L carrying a kick drum with energy both above and below 8080 Hz, at a reference level. And take the LFE carrying a sustained 4040 Hz rumble at 20-20 dBFS.

  • L channel: high-passed at 8080 Hz. Its content above 8080 Hz goes to the L satellite. Its content below 8080 Hz is stripped and sent to the bass bus.
  • Other mains (C, R, Ls, Rs): same treatment; their sub-8080 Hz parts join the bass bus.
  • LFE: the 4040 Hz rumble at 20-20 dBFS gets the +10+10 dB LFE gain, becoming effectively 10-10 dB relative to a single main at the same digital level, then is low-passed and added to the bass bus.

If the LFE rumble is at 20-20 dBFS and a main's bass component is at, say, 26-26 dBFS, then after the LFE's +10+10 dB the rumble sits at an effective 10-10 dB-equivalent while the main's redistributed bass sits at 26-26 dB. The subwoofer reproduces the sum. In linear amplitude, with the main's bass at 1026/20=0.050110^{-26/20}=0.0501 and the LFE post-gain at 1020/20×3.162=0.1×3.162=0.316210^{-20/20}\times 3.162 = 0.1 \times 3.162 = 0.3162, the combined sub amplitude (assuming coherent worst case) is 0.0501+0.3162=0.3660.0501 + 0.3162 = 0.366, i.e. about 8.7-8.7 dBFS. The point of the worked numbers: the LFE's +10+10 dB makes it dominate the sub bus, which is exactly the intent — authored effects bass is meant to be felt — while the managed bass from the mains rides underneath. A calibration error of "forgot the +10+10 dB" would have the rumble at 0.10.1 instead of 0.31620.3162, i.e. 1010 dB too quiet, and the explosion would land soft.

The Matrix Lineage: Encode/Decode Made Visible

The matrix systems are historically and conceptually central, because they are the one place in channel-audio history where the encode and decode separate again and become visible. A matrix system encodes more channels than it transmits, and decodes them back — the very pattern the channel collapse hides.

Dolby Stereo (1976): a 4:2:4 matrix

By the mid-1970s cinemas wanted surround but the optical soundtrack on 35 mm film had room for only two channels alongside the picture. Dolby Stereo (1976, building on earlier quadraphonic matrix ideas) solved this with a 4:2:4 matrix: four source channels — Left, Centre, Right, Surround — are encoded (matrixed) down to two transmission channels called Lt and Rt ("Left total", "Right total"), carried as ordinary 2-channel optical stereo, then decoded back to four on playback.

The encode is a linear matrix. With unity feeds:

Lt=L+12C+12(jS),Rt=R+12C12(jS),\begin{aligned} L_t &= L + \tfrac{1}{\sqrt{2}}\,C + \tfrac{1}{\sqrt{2}}\,(j\,S),\\[4pt] R_t &= R + \tfrac{1}{\sqrt{2}}\,C - \tfrac{1}{\sqrt{2}}\,(j\,S), \end{aligned}

where the 12\tfrac{1}{\sqrt{2}} (3-3 dB) coefficients split centre and surround equally between the two transmission channels (preserving total power), and the crucial detail is the jj: the surround signal is fed with a +90+90^\circ phase shift into LtL_t and 90-90^\circ into RtR_t (a net 180180^\circ relative phase between the two surround contributions). The ±90\pm 90^\circ Dolby surround phase encoding is what lets a passive decoder separate surround from centre, as we now show.

Passive decode

A passive Dolby Stereo decoder recovers four outputs by simple sums and differences:

C=Lt+Rt,S=LtRt,L=Lt,R=Rt.\begin{aligned} C' &= L_t + R_t,\\ S' &= L_t - R_t,\\ L' &= L_t,\qquad R' = R_t . \end{aligned}

Why does this work? Consider the centre and surround contributions:

  • Centre: appears as +12C+\tfrac{1}{\sqrt 2}C in both LtL_t and RtR_t with the same sign. So Lt+RtL_t + R_t reinforces CC (giving 2C\sqrt 2\,C) while LtRtL_t - R_t cancels it. Centre therefore steers cleanly to the CC' output and is rejected by SS'.
  • Surround: appears as +12jS+\tfrac{1}{\sqrt 2}jS in LtL_t and 12jS-\tfrac{1}{\sqrt 2}jS in RtR_topposite signs. So LtRtL_t - R_t reinforces SS (giving 2jS\sqrt 2\, jS) while Lt+RtL_t + R_t cancels it. Surround steers cleanly to SS' and is rejected by CC'.

The sum/difference operation is the anti-matrix: centre is the in-phase common-mode component, surround is the anti-phase difference component, and they are separated by symmetry. This is a beautiful, fully analogue encode/decode pair — and it is exactly the mid/side logic of stereo (sum = mid, difference = side; see stereo), repurposed to carry a hidden surround channel.

A worked passive-decode example

Encode a single surround-only source of unit amplitude (a helicopter behind the audience), with L=C=R=0L=C=R=0, S=1S=1:

Lt=+12j,Rt=12j.L_t = +\tfrac{1}{\sqrt 2} j, \qquad R_t = -\tfrac{1}{\sqrt 2} j .

Passive decode:

C=Lt+Rt=+12j12j=0,C' = L_t + R_t = +\tfrac{1}{\sqrt 2}j - \tfrac{1}{\sqrt 2}j = 0, S=LtRt=+12j+12j=2j.S' = L_t - R_t = +\tfrac{1}{\sqrt 2}j + \tfrac{1}{\sqrt 2}j = \sqrt{2}\, j .

The centre output is exactly zero — perfect rejection — and the surround output carries the full signal (the 2\sqrt 2 and the residual jj phase are cosmetic and removed by normalisation and the surround delay/filter chain). Now do the reverse: a centre-only source, C=1C=1, L=R=S=0L=R=S=0:

Lt=Rt=12,C=2,S=0.L_t = R_t = \tfrac{1}{\sqrt 2}, \quad C' = \sqrt 2, \quad S' = 0 .

Again perfect separation along the C/S axis. The limitation appears with simultaneous content: a source that is genuinely in L and a source in C both contribute to LtL_t, and the passive matrix cannot tell them apart. Adjacent-channel separation is only about 33 dB in a passive 4:2:4 matrix (L vs C, C vs R, etc.), even though opposite pairs (L vs R, C vs S) separate fully. That weak adjacent separation is the whole motivation for active steering.

Active matrices and steering: Pro Logic and beyond

A passive matrix is open-loop: its decode coefficients are fixed. An active matrix decoder adds a steering control loop. It continuously analyses the dominant direction of the soundfield — by comparing the levels and phase relationships of LtL_t and RtR_t — and dynamically adjusts the decode coefficients to enhance separation along the dominant axis, applying gain to the dominant output and attenuating the others (a voltage-controlled, signal-dependent matrix).

  • Dolby Pro Logic (1987, consumer): the home version of the cinema 4:2:4 decoder, with active steering that pushes adjacent-channel separation from ~33 dB up toward 30304040 dB along the currently dominant direction. It outputs L, C, R and a single, mono, band-limited surround (typically rolled off above ~77 kHz and delayed).
  • Dolby Pro Logic II (2000, Jim Fosgate / Dolby): a major redesign that derives stereo (left/right) surrounds with full bandwidth, plus better steering and an optional "Music" mode for upmixing plain stereo. Its surround encode does not rely on the 9090^\circ phase trick alone, using improved matrix coefficients for better stability.
  • Pro Logic IIx / IIz: extensions to 6.1 / 7.1 (IIx adds back-surrounds) and to height (IIz adds front height channels), again by deriving extra outputs from fewer input channels.

What steering buys and costs

Steering buys apparent separation: dialogue stays glued to the centre even when music is loud in L/R, and a panned effect tracks crisply. It costs:

  • Steering artefacts ("pumping"/"breathing"): because the matrix gains move in response to the signal, the background level of non-dominant channels modulates with the dominant content. A loud centre voice can audibly "duck" the surround ambience.
  • Single-dominant assumption: the steering logic can only chase one dominant direction at a time. Two simultaneous strong sources in different directions confuse it.
  • Non-determinism: the decoded result depends on decoder internals, so a mix does not reproduce identically on different decoders.

These costs are the practical reason the industry moved to discrete transmission as soon as bandwidth allowed. But note the conceptual payoff: an active matrix decoder is, literally, an upmixer — it takes 2 channels and intelligently fabricates more. Hold that thought for the closing section.

Discrete Surround: Dolby Digital and DTS

From matrix to discrete

A matrix is a workaround for a channel-count bottleneck. Once digital compression made it possible to carry five full, independent channels plus LFE in the bit budget of a soundtrack, the workaround was no longer needed: transmit each channel discretely and skip the matrix entirely.

  • Dolby Digital (AC-3), 1992: the perceptual audio codec introduced on 35 mm film (data printed between the sprocket holes) and later the backbone of DVD, ATSC television and Blu-ray. It carries 5.1 discrete channels (L, C, R, Ls, Rs, LFE) using a perceptual transform coder (an MDCT-based codec exploiting masking — again, see psychoacoustics) at bit rates around 384384640640 kbit/s for 5.1. The technical groundwork is documented by Todd, Davidson, Davis and colleagues.
  • DTS (Digital Theatre Systems), 1993: a competing discrete 5.1 system, first used on Jurassic Park, carried on a separate CD-ROM synchronised to the film (and later on DVD/Blu-ray). DTS uses a higher bit rate / lighter compression than AC-3 in many configurations.

Both encode their channels with perceptual compression, but the spatial representation is discrete: channel ii is independent of channel jj.

Why discrete beats matrix

Property4:2:4 matrix (Pro Logic)Discrete 5.1 (AC-3 / DTS)
Transmitted channels25.1
Adjacent separation~3 dB passive, steered otherwiseFull (channels independent)
Surroundmono, band-limited (PL1)full-range stereo Ls/Rs
Steering artefactsyes (pumping)none
Reproducibilitydecoder-dependentdeterministic
Backward stereo-compatibleyes (Lt/Rt plays as stereo)via metadata downmix

Discrete wins on every axis except one: a matrix-encoded Lt/RtL_t/R_t pair is itself a perfectly valid stereo signal, so it is automatically backward-compatible with two-channel playback. Discrete formats recover that compatibility differently — through downmix metadata, the subject of the next section. The trade is: matrix gives you free stereo fallback but poor separation; discrete gives you perfect separation but must carry instructions for how to fold down. The encode/decode collapse returns here too: even discrete 5.1 is still channel-based, so it still cannot play on an arbitrary layout without conversion.

Downmix: Folding Surround Down

The problem

A discrete 5.1 mix must play on stereo and mono systems — billions of them. Because the channels are the layout, there is no renderer to consult; instead the format defines a downmix: a fixed linear combination that folds 5.1 into stereo (Lo/Ro, "Left only/Right only" stereo, or Lt/Rt matrix-stereo) and stereo into mono. This is a decode — re-deriving feeds for a smaller layout from channels that assumed a larger one.

The standard ITU/Dolby coefficients

The widely used stereo downmix (the ITU-R BS.775 / Dolby-style Lo/RoL_o/R_o "stereo" downmix) combines the front, centre and surround channels as:

Lo=L+aC+bLs,Ro=R+aC+bRs,\begin{aligned} L_o &= L + a\,C + b\,L_s,\\ R_o &= R + a\,C + b\,R_s, \end{aligned}

with the canonical coefficients

a=120.707    (3 dB),b=120.707    (3 dB),a = \frac{1}{\sqrt 2} \approx 0.707 \;\; (-3\ \text{dB}), \qquad b = \frac{1}{\sqrt 2} \approx 0.707 \;\; (-3\ \text{dB}),

so that the centre folds equally and at 3-3 dB into both L and R (re-creating a phantom centre at equal power), and each surround folds at 3-3 dB into its same-side front. The LFE is normally discarded in downmix (most stereo systems cannot reproduce it and it would otherwise overload). AC-3 metadata lets the mixer choose the centre and surround mix levels from a defined set; the common alternatives are:

  • Centre mix level: 3.0-3.0 dB (1/21/\sqrt2), 4.5-4.5 dB (0.5950.595), or 6.0-6.0 dB (0.50.5).
  • Surround mix level: 3.0-3.0 dB, 4.5-4.5 dB, or 6.0-6.0 dB.
Rule of thumb

The 4.5-4.5 dB option (104.5/20=0.595710^{-4.5/20} = 0.5957) is a frequent default that pulls the centre and surrounds back slightly relative to the straight 3-3 dB sum, reducing the risk that folded-down dialogue and surround ambience overpower the L/R bed.

For the Lt/RtL_t/R_t matrix-compatible downmix (so the result can be re-expanded by a Pro Logic decoder), the surrounds are instead combined in anti-phase between channels, mirroring the Dolby Stereo encode of the previous section.

A worked downmix example

Take a moment of a 5.1 mix with these instantaneous channel amplitudes (all in the same units, taking a worst-case coherent sum):

L=0.30,C=0.50,R=0.30,Ls=0.20,Rs=0.20,LFE=0.40.L = 0.30,\quad C = 0.50,\quad R = 0.30,\quad L_s = 0.20,\quad R_s = 0.20,\quad \text{LFE} = 0.40 .

Using the Lo/RoL_o/R_o stereo downmix with a=b=1/2=0.707a=b=1/\sqrt 2 = 0.707 and discarding LFE:

Lo=L+0.707C+0.707Ls=0.30+0.707(0.50)+0.707(0.20),L_o = L + 0.707\,C + 0.707\,L_s = 0.30 + 0.707(0.50) + 0.707(0.20), Lo=0.30+0.3536+0.1414=0.795.L_o = 0.30 + 0.3536 + 0.1414 = 0.795 .

By symmetry Ro=0.795R_o = 0.795. Compare this to the loudest single contributing channel (C=0.50C = 0.50): the fold-down produces an amplitude of 0.7950.795, which exceeds unity-referenced headroom risk if several channels are correlated. Now repeat with the safer 4.5-4.5 dB coefficients (a=b=0.5957a=b=0.5957):

Lo=0.30+0.5957(0.50)+0.5957(0.20)=0.30+0.2979+0.1191=0.717.L_o = 0.30 + 0.5957(0.50) + 0.5957(0.20) = 0.30 + 0.2979 + 0.1191 = 0.717 .

The 4.5-4.5 dB choice shaves the peak from 0.7950.795 to 0.7170.717 — about 0.90.9 dB of headroom recovered — which is exactly why broadcasters often prefer it: it lowers the chance that a centre-heavy dialogue scene, folded to stereo, clips or sounds forward. This is the practical face of the layout collapse: the same content must be re-balanced for the smaller layout, and there is no single correct answer, only metadata-driven choices.

Mono and stereo fold-down safety

Mono fold-down sums Lo+RoL_o + R_o (often with a further 3-3 dB or 6-6 dB), and now the phase relationships matter acutely. Anything panned in anti-phase between L and R — wide stereo effects, or surrounds folded in Lt/RtL_t/R_t anti-phase — will cancel in mono.

Mono cancellation

A surround helicopter that sounds great in 5.1 and fine in stereo can vanish on a mono TV. The mixer's defence is mono compatibility checking: monitor the mono fold-down during the mix and avoid placing essential content (especially dialogue, which is why dialogue lives in the in-phase centre) where it can cancel. The general lesson — that every reduction in channel count is a lossy, phase-sensitive decode — is the channel-based system's permanent tax.

Pros and Cons of Channel-Based Audio

Having built the system from the ground up, we can now weigh it honestly.

Strengths

  • Simplicity and ubiquity. A channel mix needs no renderer. Decode is the identity map. Every interface, console, recorder and codec understands "this is the Left channel". Thirty years of films, broadcasts, games and music are encoded this way, and the entire professional infrastructure — monitoring, metering, transmission — is built around it.
  • Deterministic and auditable. What the mixer hears on the reference layout is exactly what is transmitted. There is no playback-time computation to second-guess. For broadcast compliance and archival, this determinism is valuable.
  • Robustness. On the reference layout, with correct calibration, channel audio is rock-solid. No steering artefacts, no rendering assumptions — just speakers playing their channels.
  • Efficient for the target layout. If you know the playback geometry (a cinema built to spec), baking the decode in is maximally efficient: no per-object computation, no metadata overhead.

Weaknesses

  • Layout-lock (the central flaw). Because the channels are the layout, a mix is correct on exactly one geometry. Everything else needs a lossy, choice-laden conversion (downmix or upmix). A 5.1 mix has no native meaning on 7.1.4, on headphones, or on a soundbar.
  • Geometric fragility. Panning gains assume specific speaker angles; non-reference angles shift every phantom.
  • Combinatorial format explosion. Stereo, 5.1, 7.1, 7.1.4, 9.1.6, 22.2 are each separate deliverables requiring separate mixes. Adding height or width means a new format, not a re-decode.
  • No object-level control downstream. You cannot raise just the dialogue, or re-position a single source at playback, because the sources are already summed into channels. The mix is a frozen decode.
  • Lossy reductions are phase-sensitive. As shown, downmix and mono fold-down can cancel content and overload headroom.

The layout-lock problem, stated precisely

Every fragility above reduces to one structural fact: the encode and the decode are the same step. Because the mixer's encode is the decode, there is no representation of the spatial intent that survives independently of the target layout. The intent exists only as it was rendered. Any other layout must reverse-engineer the intent from the rendered channels — and reverse-engineering is, in general, impossible to do exactly. This is not a bug to be fixed within channel audio; it is the defining property of channel audio. Fixing it requires un-collapsing the encode and decode — which is precisely what the next paradigms do.

Key Point: Matrix Decoding Is Upmixing — and Objects Are the Successor

The realisation

We end by drawing together two threads that have run through the chapter. A Dolby Pro Logic decoder takes a 2-channel Lt/RtL_t/R_t signal and fabricates centre, surround and (in PLII) height channels that were not discretely present. State that plainly: an active matrix decoder is an upmixer. It analyses inter-channel level and phase, infers spatial intent, and synthesises additional speaker feeds.

The unbroken lineage

A two-channel signal already contains spatial information — in its inter-channel level differences, phase relationships and decorrelation (see stereo is already spatial) — and a sufficiently clever decoder can extract that information to drive more loudspeakers than the signal nominally has channels. The matrix systems were the first commercial upmixers, and their modern descendants (frequency-domain, source-separation-based upmixers; see DAM Audio's own work in this area and the discussion in the HSR stereo upmixing article) do the same job with far more sophistication. The lineage is unbroken: passive matrix → active steering → modern statistical upmix. All of them are decoders re-deriving a layout from a representation that has fewer channels than the layout demands.

Why this matters for the encode/decode story

Recognising matrix decode as upmix exposes the deepest truth of channel audio: even within the "collapsed" paradigm, the moment you play a mix on anything but its reference layout, you are forced back into a genuine decode — you must reconstruct a field from a representation. The collapse was never permanent; it only held on the one blessed geometry. Off that geometry, downmix and upmix re-introduce the very encode/decode separation the format tried to avoid, but now without the original spatial metadata, so they must guess. The whole apparatus of downmix coefficients, steering logic and statistical upmix exists to paper over the information that the collapse threw away.

Objects as the successor

The logical conclusion is to stop collapsing in the first place — to keep the spatial intent in a layout-independent representation and defer the decode to playback, where the actual speaker geometry is known. That is object-based audio (see object-based): transmit each source as audio plus position metadata, and let a renderer compute the speaker gains for whatever layout is present — 5.1, 7.1.4, 22.2, a soundbar, or binaural headphones — by amplitude panning or other methods at decode time. The encode (author objects) and decode (render to the room) are finally, cleanly separated. The same impulse drives scene-based Ambisonics, which represents the field in spherical harmonics and decodes to any array.

Channel-based surround is therefore best understood not as a dead end but as the first answer to the spatial-audio problem — an answer that traded flexibility for simplicity by baking the decode into the file. It remains everywhere, it remains useful, and on its reference layout it remains excellent. But its fragilities — geometric, layout, count — are the precise problems that the object-based and scene-based successors were designed to solve, by restoring the encode/decode separation that channel audio collapsed. Understanding why the collapse fails is the best possible preparation for understanding why objects and Ambisonics are built the way they are. For the broader taxonomy of how these formats relate, continue to formats; for the perceptual foundations underpinning phantom imaging, the precedence effect and bass non-localisation that this chapter leaned on throughout, revisit psychoacoustics.

References

  1. ITU-R Recommendation BS.775Multichannel stereophonic sound system with and without accompanying picture. International Telecommunication Union, Geneva. (Defines the 3/2 reference geometry, the ±30\pm30^\circ / ±110\pm110^\circ layout, LFE, and standard downmix.)
  2. Todd, C., Davidson, G., Davis, M., Fielder, L., Link, B., Vernon, S. (1994). AC-3: Flexible Perceptual Coding for Audio Transmission and Storage. 96th AES Convention, Preprint 3796.
  3. ATSC A/52Digital Audio Compression Standard (AC-3, E-AC-3). Advanced Television Systems Committee. (LFE +10+10 dB, downmix metadata and coefficients.)
  4. Dressler, R. (2000). Dolby Pro Logic II Decoder — Principles of Operation. Dolby Laboratories.
  5. Gerzon, M. A. (1992). Optimum Reproduction Matrices for Multispeaker Stereo. Journal of the Audio Engineering Society, 40(7/8), 571–589.
  6. Rumsey, F. (2001). Spatial Audio. Focal Press, Oxford. (Channel-based surround, downmix, LFE and bass management.)
  7. Holman, T. (2008). Surround Sound: Up and Running (2nd ed.). Focal Press. (Practical 5.1/7.1 layout, calibration, bass management.)
  8. Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised ed.). MIT Press. (Summing localisation, precedence effect underlying phantom imaging.)
  9. Bosi, M., Goldberg, R. E. (2003). Introduction to Digital Audio Coding and Standards. Kluwer Academic. (AC-3/DTS perceptual coding context.)

← Back to Spatialization Techniques