Stereo Is Already Spatial

Every chapter of this part has followed one logic: an author or a microphone array encodes a sound field into a finite representation, and a renderer decodes that representation into the loudspeaker or headphone feeds for whatever system is physically present. Amplitude panning encodes a direction into two or three gains; Ambisonics encodes the field into spherical-harmonic coefficients; Wave Field Synthesis encodes wavefronts into driving functions; binaural encodes the two ear signals. In each case the representation is the contract, and the decoder is free to serve a stereo pair, a 7.1.4 ceiling rig, a 64-channel ring, or a pair of earbuds.

This chapter applies that same encode/decode lens to the single most common format on Earth: two-channel stereo. The claim is not rhetorical. A stereo mix is not two mono signals that happen to travel together. It is a deliberate, lossy-but-rich spatial encoding, with measurable parameters — inter-channel level, time and correlation differences — that map directly onto the perceptual cues developed in psychoacoustics. Once you see stereo as an encoding, the problem of playing it on many speakers stops being "upmixing" in the pejorative sense (inventing channels) and becomes what it actually is: decoding a field, then re-rendering it with the spatial tools of this part. That reframing is the bridge from the academic techniques of Part II to the DAM Audio production tools — HSR, RIPL, ISE and ICS — that close it.

The Thesis: Stereo Is an Encoding, Not Two Mono Signals

From "two channels" to "one field"

It is tempting to model a stereo file as a $2 \times N$ matrix of samples and stop there. That model is correct about storage and wrong about meaning. The two channels of a competent stereo mix are not independent: they are two projections of an intended spatial scene, and the relationships between them — not the channels in isolation — carry the spatial information. This is exactly the structure introduced in stereo: the perceived position of a source between two loudspeakers is set by the ratio of the two channel signals, not by either one alone.

Consider a single dry source panned by a constant-power law to azimuth parameter $\theta$ . The two channel signals are

L(t) = g_L\, s(t), \qquad R(t) = g_R\, s(t), \qquad g_L^2 + g_R^2 = 1 .

Nothing here is "left audio" and "right audio" as separate content. There is one signal $s(t)$ , and a pair of gains that encodes a direction. The listener's auditory system reconstructs that direction from the inter-channel level difference (and, in real rooms, the summing localization of the two arrivals). The encoding is the gain pair; the decoding is whatever maps that pair back onto the available transducers.

Why this matters for everything downstream

If stereo is an encoding, then three things follow, and they organize the rest of the chapter:

A stereo signal has internal structure worth measuring — correlation, width, diffuseness — not just two waveforms (next section).
Treating the two channels as independent mono objects, or routing them naively to speaker subsets, destroys that structure (the "why the obvious fixes fail" section).
The correct operation is to estimate the encoded field and re-encode it for the target system — the universal encode/decode contract of techniques, applied in reverse to legacy content.

Blumlein understood this in 1931. His patent did not describe "panning a mono source"; it described capturing and reproducing the directional information of a sound field through inter-channel differences, with explicit conversion between level-difference (intensity) and time-difference (phase) representations.

Key takeaway

Stereo was conceived as spatial from its first day. The industry's later habit of thinking of "the left channel" and "the right channel" as two tapes is a storage convenience, not the physics.

What a Stereo Signal Actually Encodes

A stereo mix simultaneously carries three perceptually distinct kinds of content, and a good decoder must recover all three. They differ in their inter-channel correlation.

Localizable point sources: inter-channel level (and time) differences

A source meant to sit at a definite position is encoded with high inter-channel coherence and a level (and sometimes time) offset. Define the inter-channel level difference in decibels and the inter-channel time difference as

\mathrm{ICLD} = 20\log_{10}\!\frac{g_L}{g_R}, \qquad \mathrm{ICTD} = \tau_L - \tau_R .

For an amplitude-panned source these two channels are scaled copies of one signal, so their normalized cross-correlation at the right lag is essentially $1$ . Perceptually, the summing-localization mechanism fuses the two coherent arrivals into a single phantom image whose angle tracks the ICLD. This is the "primary" content: it has a place.

Extended sources: partial correlation and apparent source width

Real instruments, ensembles and reverberant tails are not points. They are encoded with partial correlation: the two channels share structure but also differ. The degree of similarity is the inter-channel cross-correlation coefficient

\mathrm{ICC} = \max_{\tau}\; \frac{\big|\,\mathbb{E}[L(t)\,R(t+\tau)]\,\big|}{\sqrt{\mathbb{E}[L^2]\,\mathbb{E}[R^2]}}, \qquad 0 \le \mathrm{ICC} \le 1 .

An $\mathrm{ICC}=1$ collapses to a point; $\mathrm{ICC}=0$ spreads to maximum width or diffuseness. Intermediate values control Apparent Source Width (ASW) — the perceived broadening of a source — a relationship developed in direct, diffuse and envelopment. Width is therefore not stored as a "width number"; it is stored as decorrelation, and any decoder that ignores correlation cannot reproduce it.

Diffuse content and envelopment: low coherence

Late reverberation, applause, crowd noise and ambience are encoded with low coherence ( $\mathrm{ICC}\to 0$ ) and roughly equal energy in both channels. This near-uncorrelated content is what produces Listener Envelopment (LEV) — the sense of being surrounded — even on a two-speaker system, because the uncorrelated arrivals defeat the precedence-based fusion into a single image. The decoder's hardest job is to separate this enveloping bed from the directional foreground without smearing either.

The measurable parameter set (ISO/IEC 23003-1)

These quantities are not informal. The MPEG Spatial Audio Coding standard, ISO/IEC 23003-1 (MPEG Surround), formalizes a parameter set that captures exactly this structure between channel pairs:

Parameter	Symbol	Encodes perceptually	Range
Channel/Inter-channel Level Difference	CLD / ICLD	Lateral image position	roughly $-50$ to $+50$ dB
Inter-channel Cross-Correlation	ICC	Source width / diffuseness	$0$ to $1$
Inter-channel Phase Difference	IPD	Fine time/phase cue	$-\pi$ to $+\pi$
Inter-channel Time Difference	ICTD / OTT delay	Coarse time cue, envelopment	tens of samples

MPEG Surround's insight is precisely the thesis of this chapter: a multichannel scene can be carried as a downmix plus these spatial parameters, because the parameters are the spatial information. A legacy stereo file lacks the explicit side-chain of parameters — but the parameters are still latent in the two channels, recoverable by analysis. Decoding stereo for many speakers is, in effect, estimating an MPEG-Surround-like parameter set from the signal itself and then synthesizing the target layout.

A worked example: from gains to ICLD and width

Take a source panned with $g_L = 0.92$ , $g_R = 0.39$ (note $0.92^2 + 0.39^2 \approx 0.85 + 0.15 = 1.00$ , a constant-power pair). The encoded level difference is

\mathrm{ICLD} = 20\log_{10}\!\frac{0.92}{0.39} = 20\log_{10}(2.36) \approx 7.4\ \text{dB}.

By the tangent law of amplitude panning on a $\pm 30^\circ$ pair, that ratio places the phantom near

\tan\theta = \frac{g_L - g_R}{g_L + g_R}\tan 30^\circ = \frac{0.53}{1.31}\times 0.577 \approx 0.234 \;\Rightarrow\; \theta \approx 13^\circ \text{ left of centre}.

Because both channels are scaled copies of one $s(t)$ , $\mathrm{ICC}=1$ : the decoder should render this as a narrow image at $13^\circ$ . Now add an independent reverb return $r_L, r_R$ with equal energy and $\mathrm{ICC}=0.1$ at $-12$ dB relative to the dry source. The measured channel correlation of the sum drops below $1$ , and a competent analyzer must attribute the coherent part to a point at $13^\circ$ and the incoherent part to an enveloping bed — two outputs from one stereo input.

The Hardware/Content Gap

Speakers have multiplied; content has not

Reproduction systems have exploded in channel count. A modern cinema runs dozens of independent surround and ceiling feeds; immersive music venues install rings and domes; cars ship with 12 to 30 drivers; even living-room soundbars synthesize height. Yet the catalogue that flows through these systems is overwhelmingly two-channel. Music streaming is essentially all stereo; broadcast, podcasts, user video, games' music stems, archival material — stereo. Industry estimates put genuinely immersive (object or scene-based) content at a low single-digit percentage of what is actually played; in round terms, 97 to 99% of real content is stereo, while the hardware to play far more is already deployed.

System	Independent channels	Typical content fed to it
Earbuds / headphones	2	Stereo
Phone / laptop	2	Stereo
Soundbar	3–9 (synth.)	Stereo, some 5.1
Car audio	12–30	Stereo (radio, streaming)
Home cinema	6–16	Mostly stereo, some 5.1/Atmos
Immersive venue	16–96+	Stems, but stereo masters dominate

Why "just make more immersive content" is not the answer

The gap will not be closed at the source. Re-authoring the world's stereo back-catalogue as objects is economically impossible, and live and broadcast pipelines emit stereo in real time with no stems to revisit. The only scalable lever is at playback: take the stereo that exists and render it well onto whatever is installed. That makes high-quality stereo-to-many-speakers a first-class spatialization problem, not a niche convenience — and it is exactly where the techniques of this part must be turned around to decode rather than only to author. (For why even two speakers are already doing spatial work, see stereo is already spatial within a pair.)

Why the Obvious Fixes Fail

Three naive approaches dominate cheap "surround" modes. Each breaks the encoding it is supposed to preserve.

Naive L/R routing: energy error and comb filtering

The simplest trick sends the left channel to several left-side speakers and the right channel to several right-side speakers (or copies L/R to surrounds). Two failures follow immediately.

Energy error. If one coherent signal drives $n$ speakers with equal gain $g$ , the acoustic pressures from coherent sources add roughly in amplitude in the listener's overlap region, not in power. Summing $n$ coherent contributions multiplies pressure by up to $n$ and on-axis energy by up to $n^2$ , whereas correct constant-power distribution requires the per-speaker gain to fall as $g = 1/\sqrt{n}$ so that

\sum_{i=1}^{n} g_i^2 = 1 .

Routing the same signal at unity to $n=3$ speakers therefore over-drives the directional energy by up to $10\log_{10}(n^2)=10\log_{10}9 \approx 9.5$ dB in the worst, fully coherent case relative to a single calibrated source, pulling images and unbalancing the mix.

Comb filtering. Worse, the multiple coherent copies arrive at the ears with different path delays. Two equal coherent arrivals separated by delay $\tau$ sum to a magnitude response

|H(f)| = 2\,\big|\cos(\pi f \tau)\big|,

with nulls at $f_k = (2k+1)/(2\tau)$ . A modest $\tau = 0.5$ ms inter-speaker path difference puts the first null at $1$ kHz, the next at $3$ kHz — right across the speech and presence band — producing the hollow, phasey timbre that destroys timbral neutrality and clouds the direct/diffuse balance.

warning

Routing coherent copies to many speakers is, acoustically, building a comb filter at the listener. (DAM's ICS exists precisely to suppress this residual interference; see the pipeline section.)

Stereo as two mono objects in a spatializer

Common mistake

A more sophisticated mistake is to import the two channels into an object spatializer and place "L" at $-30^\circ$ and "R" at $+30^\circ$ as two independent point objects. This is wrong because the panning machinery of this part — VBAP, Ambisonic encoding, WFS driving functions — assumes each object is a mono point source. The two stereo channels are not independent points; they are correlated projections of a whole scene.

Re-panning them as points:

Ignores inter-channel correlation. A centre phantom (equal coherent L and R) is the listener's fusion of two arrivals; rendering L and R as two separated point objects can either re-fuse into a hard centre or split into two images, depending on the listener's position, and in either case the original ASW is lost.
Mishandles diffuse content. The enveloping low-ICC bed gets forced to two points, collapsing envelopment into two localizable blobs.
Double-encodes phantom geometry. The stereo encoding already is a directional encoding; panning it again applies a second spatial transform on top of the first, with no model relating the two.

The spatializer is a superb decoder of point sources and decoded fields, but a stereo pair is neither until it has been analyzed. The missing step is the decode.

FFT-based upmixers: artefacts, latency, CPU

The classic academic upmixers (and many commercial ones) work in the short-time Fourier transform domain: window, FFT, estimate per-bin correlation and panning, repartition energy into output bins, inverse FFT, overlap-add. Conceptually elegant, but with practical costs:

Latency. A frame-based STFT imposes algorithmic latency on the order of the window length. A $4096$ -point window at $48$ kHz is about $85$ ms before overlap, far too much for live reinforcement or in-car use where lip-sync and feel matter.
Artefacts. Time-varying gains applied per FFT bin produce musical noise (warbling isolated tones), pre-echo on transients (a spread of energy before a click because the gain change smears across the whole window), and spectral-leakage smearing of sharp onsets. These are exactly the artefacts a mastering ear rejects.
CPU and block dependence. Per-channel FFT/IFFT plus overlap-add per output speaker scales poorly to many outputs and couples behaviour to block size, complicating real-time guarantees.

None of these are fatal to offline research, but they disqualify FFT upmixing from transparent, low-latency, many-speaker playback — motivating a time-domain approach (HSR, below).

The Right Model: Decode the Field, Then Render

Upmixing is decoding, not fabrication

The pejorative sense of "upmixing" — inventing channels that were never recorded — comes from tools that fabricate energy (synthetic reverb, copied channels, FFT smear). The principled alternative reuses the contract of techniques: a stereo file is an encoding of a field, so the legitimate operation is to estimate that field's components and re-render them for the present system. Nothing is invented; the energy already present is re-attributed to the directions and diffuseness it encodes. This is the same move as Ambisonic decoding (coefficients to speakers) or binaural rendering (HRTF-encoded field to ears), only the source representation is legacy stereo.

Formally, model the stereo pair as a sum of a coherent primary field $\mathbf{p}(t)$ and a decorrelated ambient field $\mathbf{a}(t)$ :

\begin{bmatrix} L(t)\\ R(t) \end{bmatrix} = \underbrace{\begin{bmatrix} g_L\\ g_R \end{bmatrix} s(t)}_{\text{primary (ICC}\approx 1)} + \underbrace{\begin{bmatrix} a_L(t)\\ a_R(t) \end{bmatrix}}_{\text{ambient (ICC}\approx 0)} .

The decoder estimates $s(t)$ and its panning direction (from the coherent part) and the ambient pair (from the residual), then renders each component with the right tool: the primary as a directional source through panning/Ambisonics/WFS, the ambient as an enveloping bed spread across the surround field. This is the encode/decode discipline applied to the most ubiquitous encoding in existence.

Energy and perceptual invariants the decode must hold

A correct decode is constrained, not free. It must conserve total energy (no level inflation), preserve coherence relationships (so width and envelopment survive), and keep timbre neutral (no comb filtering, no spectral tilt). Concretely, across the whole output the decoder should satisfy a power-conservation identity per analysis unit,

\sum_{i=1}^{M} \mathbb{E}\big[y_i^2(t)\big] \;=\; \mathbb{E}\big[L^2(t)\big] + \mathbb{E}\big[R^2(t)\big],

so that distributing one input across $M$ speakers neither gains nor loses energy at the listener. These invariants are what separate a decoder from an effect.

The Academic Grounding: Primary–Ambient Decomposition

Avendano and Jot: the foundational frequency-domain decode

The modern theory of stereo decoding is primary–ambient decomposition (PAD). Avendano and Jot (JAES, 2004) proposed estimating, per time–frequency tile, an inter-channel coherence and a panning index, then using these to extract the panned (primary) components and the ambient residual. Their inter-channel similarity at frequency $f$ is the normalized cross-spectrum

\Phi(f) = \frac{\Re\{\,S_{LR}(f)\,\}}{\sqrt{S_{LL}(f)\,S_{RR}(f)}},

where $S_{LL}, S_{RR}$ are the channel auto-spectra and $S_{LR}$ the cross-spectrum. Tiles with $\Phi\to 1$ are primary and assigned a panning angle from the level ratio; tiles with $\Phi\to 0$ are ambient. The ambient signals are then routed to surround channels to synthesize envelopment. This is the first rigorous statement that a stereo signal can be decomposed into a localizable foreground and a diffuse background using only the inter-channel statistics — i.e., that the spatial information is recoverable.

Worked example: coherence-driven primary/ambient split

Suppose in one frequency band you measure $S_{LL}=1.0$ , $S_{RR}=1.0$ , and $\Re\{S_{LR}\}=0.7$ . Then

\Phi = \frac{0.7}{\sqrt{1.0\times 1.0}} = 0.7 .

A common PAD model treats the primary as the coherent part and the ambient as the residual, so the primary energy fraction is approximately $\Phi$ and the ambient fraction $1-\Phi$ :

P_\text{prim} \approx \Phi\,(S_{LL}+S_{RR}) = 0.7\times 2.0 = 1.4, \qquad P_\text{amb} \approx (1-\Phi)\,(S_{LL}+S_{RR}) = 0.6 .

The decoder sends $1.4$ units of energy to a directional render (angle from the L/R ratio) and $0.6$ units to the surround/ambient bed — and the two sum back to $2.0$ , conserving energy. Slide $\Phi$ to $0.95$ and almost everything is directional (a tight mix); slide it to $0.2$ and the band is mostly envelopment (a reverberant tail).

Goodwin and Jot: subspace methods and multichannel synthesis

Goodwin and Jot (ICASSP, 2007) generalized PAD with a principal-component / subspace formulation. Treating the short-time stereo vector as samples of a 2-D random vector, the primary direction is the dominant eigenvector of the $2\times 2$ covariance matrix

\mathbf{C} = \begin{bmatrix} \mathbb{E}[L^2] & \mathbb{E}[LR] \\ \mathbb{E}[LR] & \mathbb{E}[R^2] \end{bmatrix},

whose eigen-decomposition $\mathbf{C} = \lambda_1 \mathbf{v}_1\mathbf{v}_1^\top + \lambda_2 \mathbf{v}_2\mathbf{v}_2^\top$ separates a dominant primary subspace ( $\lambda_1, \mathbf{v}_1$ , the panning direction) from an ambient subspace ( $\lambda_2$ ). The primary/ambient power ratio is simply $\lambda_1/\lambda_2$ . They further connected the extracted components to spatial synthesis for arbitrary loudspeaker layouts — i.e., decode then render — which is precisely the pipeline this chapter advocates.

Worked example: eigen-split of the covariance

Let $\mathbf{C} = \begin{bmatrix} 1.0 & 0.6 \\ 0.6 & 1.0 \end{bmatrix}$ . For a symmetric $2\times 2$ matrix with equal diagonal $d$ and off-diagonal $c$ , the eigenvalues are $d \pm c$ :

\lambda_1 = 1.0 + 0.6 = 1.6, \qquad \lambda_2 = 1.0 - 0.6 = 0.4,

with eigenvectors $\mathbf{v}_1 = \tfrac{1}{\sqrt 2}(1,1)$ (the centred, in-phase primary) and $\mathbf{v}_2 = \tfrac{1}{\sqrt 2}(1,-1)$ (the anti-phase ambient/side). The primary-to-ambient ratio is $\lambda_1/\lambda_2 = 4$ (about $6$ dB), and the primary sits dead centre because $\mathbf{v}_1$ has equal weights. Change the off-diagonal to $0.6\to -0.2$ and the dominant eigenvector tilts off-centre while $\lambda_2$ grows: more ambience, an off-centre image. The covariance is the encoded geometry.

Faller and Breebaart: parametric spatial audio and binaural cue coding

Faller and Baumgarte's Binaural Cue Coding and Faller and Breebaart's work on parametric stereo/spatial audio established that the perceptually relevant inter-channel cues — ICLD, ICTD and ICC — are sufficient statistics for spatial reproduction: if a decoder reinstates these cues correctly per critical band, listeners cannot distinguish the parametric reconstruction from the original. This is the theoretical license for parametric stereo decoding and the direct ancestor of the ISO/IEC 23003-1 parameterization.

Rule of thumb

The practical lesson for many-speaker playback: get ICLD, ICTD and ICC right per band on the target layout and the spatial impression transfers — independent of how many speakers you have.

The DAM Audio HSR Approach in Depth

DAM Audio's HSR (High Space Resolution) is a production realization of this theory built specifically for the constraints that disqualify FFT upmixers: it decodes the stereo field in the time domain, with strict energy, coherence and timbral guarantees, suitable for real-time many-speaker playback. (See the technical overview at the HSR research page and the article HSR stereo upmixing for multi-speaker systems.)

Decoding the field via inter-channel correlation analysis

HSR's premise is identical to the academic PAD model but reframed for engineering: the stereo file is a decodable encoding, and the key observable is the inter-channel correlation structure described above. Rather than estimate it in FFT bins, HSR tracks correlation and level relationships continuously in time, classifying signal content along the coherence axis — highly correlated (directional foreground), partially correlated (extended sources), and uncorrelated (diffuse envelopment) — the same three classes catalogued earlier in this chapter.

Three stages: analysis, spatial extraction, output distribution

HSR's processing is organized as three explicit stages, mirroring the encode/decode contract:

Analysis. Estimate the running inter-channel correlation, level difference and short-time energy. This yields, instant by instant, a model of what the stereo encodes: where the coherent images sit and how much diffuse energy surrounds them. Conceptually this is computing a time-domain analogue of $\Phi$ and the panning index of Avendano and Jot.
Spatial extraction. Separate the field into N spatial components — a set of directional primaries plus the decorrelated ambient bed — using the analysis to partition energy. This is the decode proper: from two channels to a structured intermediate scene. Crucially, the correlation of each component is preserved, so source width and envelopment are carried as decorrelation, not faked with synthetic reverb.
Output distribution. Render the N components onto the actual speaker layout — few or many, horizontal or with height — distributing each component's energy with constant-power gains so the layout is filled without inflation. This is where HSR hands off to the renderer (RIPL/ISE), exactly as an Ambisonic decoder hands coefficients to a layout decoder.

Time-domain processing: no FFT, low latency

By operating in the time domain, HSR avoids the three FFT pathologies: there is no frame latency (processing is near-sample-rate, suitable for live and automotive), no musical noise or pre-echo (no per-bin gain switching across a window), and no block-size coupling. The cost — time-domain decorrelation and band-splitting are harder to design than a clean FFT — is paid once in the algorithm so that the playback is transparent.

The three guarantees: energy, coherence, timbre

HSR is specified by invariants, the engineering form of the constraints in the "right model" section:

Energy conservation. Total output energy equals total input energy; distributing one component over $n$ speakers uses $g=1/\sqrt n$ so that $\sum_i g_i^2 = 1$ . No band gets louder merely because more speakers play it.
Coherence preservation. The measured ICC of each reconstructed component matches the source, so apparent width and envelopment are reproduced rather than collapsed or exaggerated — the perceptual link from direct, diffuse and envelopment.
Timbral neutrality. Because coherent copies are not splattered across speakers (and because of downstream ICS), the magnitude response at the listener stays flat: no comb filtering, no spectral tilt. A mastered stereo balance survives the decode.

Worked example: HSR distributing a decoded component

Suppose the analysis stage attributes, in some interval, $0.80$ of the total energy to a single directional primary at $+10^\circ$ and $0.20$ to the ambient bed. The output is a $7$ -speaker horizontal ring. HSR renders the primary with VBAP between the two speakers straddling $+10^\circ$ , using gains $g_a, g_b$ with $g_a^2 + g_b^2 = 0.80$ ; say the panning puts $70\%$ on speaker $a$ , so $g_a = \sqrt{0.7\times 0.8}=\sqrt{0.56}=0.748$ and $g_b = \sqrt{0.3\times 0.8}=\sqrt{0.24}=0.490$ . The ambient $0.20$ is decorrelated and spread over the remaining surround speakers, each carrying $0.20/k$ in power. Summing everything:

g_a^2 + g_b^2 + \sum_{\text{surr}} g_j^2 = 0.56 + 0.24 + 0.20 = 1.00,

energy conserved, the image at $+10^\circ$ , the room enveloped — from an ordinary stereo file, with no added reverb and no FFT latency.

RIPL: One Source Model for Every Format

Decoding the stereo field produces components that still need to be rendered. DAM Audio's RIPL (product page) is the real-time spatializer that does this, and its defining design choice fits this chapter's thesis: one unified source model serves every output format, and a decoded stereo field is a first-class source alongside mono objects and native multichannel.

A unified source model

Rather than maintain separate code paths for "stereo," "5.1," "Atmos object," "Ambisonics," RIPL represents everything as positioned sources (or fields) in a common spatial model, then derives feeds for the present layout. This is the encode/decode contract made into software architecture: author/decode into the model once, render to any system. The renderer covered abstractly across this part — VBAP-style amplitude panning, Ambisonic decoding, WFS-style synthesis — are modes of one engine, not separate tools.

Unifying gain-based and delay-based rendering

A subtle strength is that RIPL unifies gain-based rendering (amplitude panning, constant-power distribution: the right tool for far-field phantom imaging and many small contributions) with delay-based rendering (wavefront timing, the basis of WFS and of physically correct distance/parallax). Many systems pick one paradigm and live with its limits — gain panning has a sweet spot and no true depth; pure delay rendering needs dense arrays. By treating gain and delay as two controls of the same source model, RIPL can render a near-field source with the appropriate mix of level and time cues, matching the summing-localization physics that uses both ICLD and ICTD. This is also why RIPL can host HSR's ambient bed (gain-spread, decorrelated) and a sharp primary (panned, possibly delay-aligned) in the same scene without contradiction.

Decoded stereo as a first-class source

Because a decoded stereo field enters RIPL as a native source type, the HSR decode and the spatial render are not bolted together by ad-hoc routing — the N components flow into the same model that handles objects and channels. The renderer never sees "two mono signals to place"; it sees a structured field with primaries and ambience already separated, and it distributes them with the format-appropriate mode. The mistake of "stereo as two mono objects" is structurally impossible here: stereo is admitted only after it has been decoded.

The Combined Pipeline

HSR decode → N components → spatializer → speakers

Putting the pieces together yields the production pipeline that this whole chapter has been building toward:

\text{Stereo} \;\xrightarrow{\text{HSR decode}}\; N\ \text{spatial components} \;\xrightarrow{\text{RIPL render}}\; \text{layout feeds} \;\xrightarrow{\text{speakers}} \text{listener},

with the renderer choosing its mode per system — VBAP/constant-power for discrete rings and domes, Ambisonic decoding where a periphonic scene representation is wanted, WFS-style synthesis where dense arrays permit true wavefronts. DAM's ISE (Immersive Sound Engine) (research page) is the integrated engine that ties decode and render into one real-time system for venue and installation use. Each stage honours the energy/coherence/timbre invariants, so the chain end-to-end is transparent: a stereo master arrives, and a calibrated many-speaker image leaves, with images where the encoding put them and envelopment where the decorrelation lived.

ICS: removing residual comb filtering

Even a correct decode-and-render can suffer acoustic interference at the listener, because multiple real speakers radiating related content arrive with path-length differences — the comb-filter mechanism $|H(f)| = 2|\cos(\pi f\tau)|$ from earlier, now caused by geometry rather than by naive routing. DAM's ICS (Interference Correction System) (research page) addresses this residual: it corrects the inter-speaker interference so that the summed response at the listening area stays flat, protecting the timbral-neutrality guarantee in the physical sound field, not just in the signal. ICS is the acoustic-domain complement to HSR's signal-domain coherence handling — together they keep both the content's correlation and the room's summation under control.

Worked example: why ICS is needed even after a clean decode

Two surround speakers each carry part of the same decoded ambient component, and the listener sits so that their path lengths differ by $\Delta d = 0.34$ m. The arrival delay is

\tau = \frac{\Delta d}{c} = \frac{0.34}{340} = 1.0\ \text{ms},

placing comb nulls at $f_k = (2k+1)/(2\tau) = 500\ \text{Hz}, 1500\ \text{Hz}, 2500\ \text{Hz}, \dots$ — audible coloration despite a perfect signal-domain decode. ICS measures and corrects this geometric interference so the net magnitude response is flat across the band, completing the timbral guarantee that HSR begins.

End-to-end invariants

Read as one system, the pipeline enforces three invariants from input to ear: energy is conserved through the constant-power distribution at every stage; coherence (hence width and envelopment) is preserved by carrying decorrelation as a first-class property rather than synthesizing reverb; and timbre is held flat by avoiding coherent multi-speaker splatter (HSR) and by correcting geometric interference (ICS). These are precisely the perceptual quantities — image position, ASW, LEV, spectral balance — that psychoacoustics and direct, diffuse and envelopment identify as the substance of spatial impression.

Application Domains

The decode-then-render model is general, but its payoff differs by domain, and each domain stresses a different invariant.

Live sound

Front-of-house feeds the room a stereo (or LCR) mix in real time, while the PA may be a distributed line-array system or an immersive ring. Latency is non-negotiable (the band is on stage; lip-sync and feel matter), which rules out FFT upmixers and is the headline reason for HSR's time-domain design. The win: a stereo console output spreads into an enveloping, evenly covered field without the comb filtering that naive multi-zone routing produces — and ICS keeps the overlap zones between arrays flat. See also the calibration and installation parts of this guide (named, not linked) for how the system is tuned to the venue.

Automotive

A car cabin has many drivers (doors, pillars, dash, headrests, subs) and a fundamentally off-centre listening geometry — no one sits at the stereo sweet spot. The source content is almost entirely stereo (radio, streaming, phone). Decoding the stereo field and distributing its primaries and ambient bed across the cabin's drivers, with delay-based rendering to compensate the asymmetric geometry and ICS to tame the inevitable reflections, turns an intrinsically compromised listening position into a stable, enveloping image. Energy conservation matters here because over-driving coherent copies in a small, reflective cabin is both fatiguing and unsafe.

Home and consumer

Soundbars and AVRs face the widest layout variety — from two drivers to 7.1.4 — and the most stereo-heavy catalogue. The unified source model is the key asset: decode the stereo field once, render to whatever the user owns, preserving the mastered balance (timbral neutrality) so music does not get the "surround mode" coloration that listeners switch off. The coherence guarantee is what makes a stereo album sound wider and more enveloping rather than re-reverberated.

Broadcast and streaming

Broadcast emits stereo continuously and at scale, often with no stems and tight loudness/compatibility requirements. A decoder that is energy-conserving and timbrally neutral can sit at the playout or device end and immersify a stereo feed reversibly — a downmix of the rendered output returns to the original stereo balance because no energy was fabricated. This reversibility is the practical face of the encode/decode contract: the decode adds spatial rendering for the system present without committing the content to any one layout, exactly as MPEG Surround (ISO/IEC 23003-1) carries a stereo downmix that any decoder can expand.

Common Mistakes and Pitfalls

Avoid these

Thinking of stereo as two mono files. The whole error class in the "obvious fixes" section flows from this. Stereo is an encoding; the information is in the relationships.
Re-panning L and R as point objects. It double-encodes geometry and destroys correlation. Decode first; only structured components should enter a panner.
Adding synthetic reverb to "create space." Envelopment is already in the file as decorrelation; fabricating reverb inflates energy and muddies timbre. Recover the ambient bed instead of inventing one.
Routing coherent copies to many speakers. Guaranteed comb filtering and up to $\sim n^2$ energy error. Use constant-power distribution and decorrelation.
Using FFT upmixers where latency or transparency matters. Frame latency and musical-noise/pre-echo artefacts disqualify them from live, automotive and critical-listening playback.
Ignoring room summation. Even a perfect signal decode combs in the air when real speakers overlap at the listener; correct the acoustic interference (ICS), not just the signal.
Chasing maximum width. Driving ICC to zero everywhere yields an impressively wide but unstable, phasey image with no solid foreground. Preserve the encoded coherence; do not maximize it.

Limits

The honest boundaries

Decoding stereo is recovery, not clairvoyance, and the honest boundaries are these.

Underdetermination: two channels cannot uniquely separate an arbitrary number of overlapping sources; PAD/HSR recover a primary-plus-ambient structure, not the original multitrack, so two distinct instruments panned to the same spot and equally correlated cannot be pulled apart. Bad encodings in: a mono-summed, hard-clipped, or aggressively M/S-processed master carries little inter-channel structure to decode; garbage in, limited spatial out. Geometry dependence: the render is correct over a coverage area, not at literally every point; very large or very reflective spaces still need calibration and ICS, and some sweet-spot dependence remains for sharp phantom images. Perceptual, not physical, reconstruction: like all the methods of this part, the goal is to reinstate the perceptual cues (ICLD, ICTD, ICC per band) of psychoacoustics, not to physically rebuild the original wavefronts — a true physical reconstruction would require WFS-class capture the stereo file never contained. Within these limits, the decode is principled and transparent; beyond them, no decoder can manufacture information the encoding did not preserve.

Summary

Stereo is not a degenerate case to be tolerated; it is the encoding that nearly all real content uses, and it carries genuine spatial information — localizable primaries (high ICC, set by ICLD/ICTD), extended sources (partial ICC, controlling ASW), and diffuse envelopment (low ICC) — formalized by the inter-channel parameter set of ISO/IEC 23003-1. The obvious "surround modes" fail because they ignore that encoding: naive routing combs and inflates energy, two-mono-objects double-encodes and destroys correlation, FFT upmixers add latency and artefacts. The principled path is the same encode/decode contract that governs every technique in this part: decode the field, then render it. Primary–ambient decomposition (Avendano and Jot; Goodwin and Jot; Faller and Breebaart) gives the theory; DAM Audio's HSR realizes it in the time domain with energy, coherence and timbre guarantees; RIPL renders the decoded field through one source model unifying gain- and delay-based modes; ISE integrates the chain and ICS keeps the room's summation flat. The result turns the world's stereo catalogue into immersive sound on the speakers people actually own — by reading stereo for what it has always been: already spatial.

References

Blumlein, A. D. (1931). Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems. British Patent No. 394,325. (The originating patent of stereophony as a directional encoding.)
Avendano, C., and Jot, J.-M. (2004). "A Frequency-Domain Approach to Multichannel Upmix." Journal of the Audio Engineering Society, 52(7/8), 740–749.
Goodwin, M. M., and Jot, J.-M. (2007). "Primary-Ambient Signal Decomposition and Vector-Based Localization for Spatial Audio Coding and Enhancement." Proc. IEEE ICASSP, Honolulu, pp. I-9–I-12.
Faller, C., and Baumgarte, F. (2003). "Binaural Cue Coding — Part II: Schemes and Applications." IEEE Transactions on Speech and Audio Processing, 11(6), 520–531.
Breebaart, J., van de Par, S., Kohlrausch, A., and Schuijers, E. (2005). "Parametric Coding of Stereo Audio." EURASIP Journal on Applied Signal Processing, 2005(9), 1305–1322.
ISO/IEC 23003-1:2007. Information technology — MPEG audio technologies — Part 1: MPEG Surround. International Organization for Standardization.
Pulkki, V. (1997). "Virtual Sound Source Positioning Using Vector Base Amplitude Panning." Journal of the Audio Engineering Society, 45(6), 456–466.
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised ed.). MIT Press.
Rumsey, F. (2001). Spatial Audio. Focal Press, Oxford.

← Back to Spatialization Techniques

The Thesis: Stereo Is an Encoding, Not Two Mono Signals​

From "two channels" to "one field"​

Why this matters for everything downstream​

What a Stereo Signal Actually Encodes​

Localizable point sources: inter-channel level (and time) differences​

Extended sources: partial correlation and apparent source width​

Diffuse content and envelopment: low coherence​

The measurable parameter set (ISO/IEC 23003-1)​

A worked example: from gains to ICLD and width​

The Hardware/Content Gap​

Speakers have multiplied; content has not​

Why "just make more immersive content" is not the answer​

Why the Obvious Fixes Fail​

Naive L/R routing: energy error and comb filtering​

Stereo as two mono objects in a spatializer​

FFT-based upmixers: artefacts, latency, CPU​

The Right Model: Decode the Field, Then Render​

Upmixing is decoding, not fabrication​

Energy and perceptual invariants the decode must hold​

The Academic Grounding: Primary–Ambient Decomposition​

Avendano and Jot: the foundational frequency-domain decode​

Worked example: coherence-driven primary/ambient split​

Goodwin and Jot: subspace methods and multichannel synthesis​

Worked example: eigen-split of the covariance​

Faller and Breebaart: parametric spatial audio and binaural cue coding​

The DAM Audio HSR Approach in Depth​

Decoding the field via inter-channel correlation analysis​

Three stages: analysis, spatial extraction, output distribution​

Time-domain processing: no FFT, low latency​

The three guarantees: energy, coherence, timbre​

Worked example: HSR distributing a decoded component​

RIPL: One Source Model for Every Format​

A unified source model​

Unifying gain-based and delay-based rendering​

Decoded stereo as a first-class source​

The Combined Pipeline​

HSR decode → N components → spatializer → speakers​

ICS: removing residual comb filtering​

Worked example: why ICS is needed even after a clean decode​

End-to-end invariants​

Application Domains​

Live sound​

Automotive​

Home and consumer​

Broadcast and streaming​

Common Mistakes and Pitfalls​

Limits​

Summary​

References​