Stereo & the Phantom Image

Two loudspeakers, fed with two related signals, can make you hear a voice that floats in the empty air between them. There is no source there — no driver, no cone, no air being pushed at that point in space — yet the auditory system places a stable, localizable image at a position that depends only on the relationship between the two channels. This conjured source is the phantom image (also called a virtual or summing image), and the mechanism that creates it is summing localization. Almost every spatialization technique in this part of the guide is, at its core, an elaboration of this single trick: arrange several real loudspeakers, feed them coherent signals with carefully controlled differences, and let the ear's own machinery fuse them into apparent sources at chosen positions.

Stereophony — literally "solid" (στερεός) sound — is therefore not a niche legacy format to be tolerated until "real" 3D audio arrives. It is the foundational case, the minimal system in which the central idea of this guide already appears in full: you encode a spatial scene into a finite set of channels, and the playback system decodes those channels back into a perceived field via the listener's ears. Understanding two-channel stereo deeply — its psychoacoustics, its geometry, its panning mathematics, its compatibility constraints, and its hard limits — gives you the conceptual toolkit to understand surround, object-based audio, Ambisonics, binaural, transaural, and wave field synthesis as variations on a theme.

This chapter builds that understanding from first principles. We begin with the perceptual mechanism (summing localization and the duplex cues it exploits), move to Blumlein's 1931 insight that stereo is an encoding, formalize the intensity/time continuum and its mono compatibility, fix the standard listening geometry, derive the panning law in full with worked gain tables, develop the Mid/Side representation, examine the sweet spot and precedence effect, preview stereo microphone techniques, and close with an explicit account of the limits that motivate everything that follows.

The Phantom Image and Summing Localization

Two coherent sources, one perceived source

Place two identical loudspeakers symmetrically in front of you and feed both with the same signal at the same level. You do not hear two sources at the speaker positions; you hear one source at the center, exactly between them. This is the simplest phantom image. Now raise the level of the right speaker slightly: the image slides toward the right. Delay the left speaker by a fraction of a millisecond: the image again slides toward the right. Push either difference far enough and the image collapses onto the nearer or louder speaker. Between the extremes lies a continuous "pan" from one speaker to the other, entirely controlled by inter-channel level and time differences.

Why does fusion happen at all? Because the two signals are coherent — they are essentially the same waveform — the auditory system interprets them not as two independent events but as one event reaching the ears over a combination of paths. Summing localization is the name for the rule that, for short inter-channel time differences (below roughly 1 ms) and moderate level differences, the perceived direction is a single fused position determined by the sum of the two contributions at each ear. The brain does not have a "two loudspeakers in a room" model; it has an "incoming wavefront" model, and it solves for the single direction most consistent with the ear-input signals it receives.

How loudspeaker differences become ear differences

The deep point is that summing localization works by manufacturing the natural localization cues described in psychoacoustics. A single real source off to one side produces an interaural time difference (ITD) — the wavefront reaches the near ear before the far ear — and an interaural level difference (ILD) — the head shadows the far ear, more strongly at high frequencies. These are the two halves of Lord Rayleigh's duplex theory: ITD dominates localization below about 1.5 kHz, ILD above it.

A stereo pair recreates these interaural differences synthetically. Consider a level difference between the two speakers. Each speaker radiates to both ears (there is no acoustic wall down the middle of your head). The left ear receives a strong contribution from the left speaker plus a weaker, slightly delayed contribution from the right speaker; the right ear receives the mirror image. When the left speaker is louder, the summed pressure at the left ear is larger than at the right ear — an ILD is produced at the eardrums even though the original difference between the channels was purely a level difference between two frontal speakers. Moreover, because the two contributions arrive with a small geometric path difference, their summation also produces a frequency-dependent phase difference between the ears that the low-frequency system reads as an ITD. So a single inter-channel level difference generates both an ITD-like and an ILD-like cue at the ears. This is why amplitude-only panning works for the full audio band: the summation at the head converts channel-level differences into the natural binaural cues.

Numeric example: image shift from a level difference

A widely used empirical rule for the standard $\pm 30^\circ$ stereo setup relates the inter-channel level difference $\Delta L$ (in dB) to the image displacement angle $\varphi$ (measured from center). A rough working figure is that an image reaches a loudspeaker (full deflection to $30^\circ$ ) at about $\Delta L \approx 15\text{–}18$ dB, and the relationship near center is roughly linear at about $4$ – $5^\circ$ per dB.

Suppose we want a phantom image at $\varphi = 10^\circ$ to the right. Using $\approx 5^\circ$ per dB near the center, we need

\Delta L \approx \frac{10^\circ}{5^\circ/\text{dB}} = 2\ \text{dB}.

A level difference of about $2$ dB between right and left channels places the image roughly one-third of the way from center to the right speaker. We will derive which gains produce a given $\Delta L$ from the panning law below; for now the point is that a fraction of a decibel already moves an audible image, and the entire frontal arc between the speakers is mapped onto a level-difference window only $\sim 30$ dB wide.

Blumlein and the Central Claim: Stereo Is an Encoding

The 1931 patent

The conceptual and engineering foundation of two-channel stereophony was laid by Alan Dower Blumlein in his 1931 British patent 394,325, "Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems." Working at EMI, Blumlein was explicitly trying to solve a perceptual problem: in early "talkie" cinema, the actor could be on one side of the screen while the single loudspeaker was on the other, breaking the link between seen position and heard position. His goal was directional fidelity — to reproduce the direction of sounds, not merely their timbre.

Blumlein's genius was to recognize that the directional information the ear uses can be carried by two channels and that there are two equivalent ways to encode it. He understood that the natural ITD/ILD cues at a listener's two ears could be reproduced over two loudspeakers either by encoding direction as a level difference (which he could capture with two coincident directional microphones, the "Blumlein pair" of crossed figure-of-eights) or as a time difference (captured with spaced microphones), and crucially that the two could be converted into one another. The famous shuffler network in his patent converts the phase (time) differences picked up by a spaced pair into the amplitude differences that a loudspeaker pair reproduces most robustly. That a circuit can transform one into the other is the first proof that both are encodings of the same underlying spatial quantity: direction.

Two channels, one spatial signal

Key takeaway

This is the central claim that this entire guide rests upon, and it is worth stating plainly: a stereo signal is not two mono signals that happen to be played together. It is a single two-channel spatial encoding, in which the inter-channel relationships (level and time differences, and their frequency dependence) carry the geometry of the scene.

The audible content of stereo — the width of the image, the position of each source, the sense of a recorded space — lives between the channels, in their differences and correlations, not in either channel alone. Sum the two channels to mono and that spatial information is partly or wholly destroyed; the differences cancel or collapse. The left and right channels are better thought of as two measurements of one spatial field, analogous to the two ears, than as two independent program streams.

This reframing has enormous practical consequences, and it is the subject of its own chapter: see Stereo is already spatial. It is also the bridge to Ambisonics, where the same idea — encode the field into a small set of channels (the B-format), then decode to whatever speakers exist — is generalized and made rigorous; see Ambisonics. Blumlein, in 1931, was already doing encode-then-decode: the microphone pair (or the panner) is the encoder; the loudspeaker pair plus your head is the decoder.

Numeric example: the Blumlein pair as a level encoder

A Blumlein pair is two figure-of-eight (bidirectional) microphones, coincident, crossed at $90^\circ$ , each at $45^\circ$ to the front. A bidirectional capsule has the polar gain $g(\theta) = \cos\theta$ , where $\theta$ is the angle from its main axis. For a source at azimuth $\alpha$ measured from the forward center line, the left-facing mic (axis at $-45^\circ$ ) and right-facing mic (axis at $+45^\circ$ ) produce

g_L(\alpha) = \cos(\alpha + 45^\circ), \qquad g_R(\alpha) = \cos(\alpha - 45^\circ).

For a centered source, $\alpha = 0$ : $g_L = g_R = \cos 45^\circ \approx 0.707$ , equal levels, image centered. For a source at $\alpha = 20^\circ$ to the right:

g_L = \cos 65^\circ \approx 0.423, \qquad g_R = \cos 25^\circ \approx 0.906.

The inter-channel level difference is

\Delta L = 20\log_{10}\frac{g_R}{g_L} = 20\log_{10}\frac{0.906}{0.423} \approx 6.6\ \text{dB}.

The pair has encoded a $20^\circ$ source direction as a $6.6$ dB level difference — purely by the geometry of two coincident polar patterns, with no time difference at all (coincidence means both capsules sit at effectively the same point, so the arrival times are equal). Decoding is then the listener's summing localization translating that $6.6$ dB back into a perceived direction.

Intensity Stereo, Time Stereo, and the Continuum

Two roads to the same image

Because summing localization responds to both inter-channel level and inter-channel time differences, there are two pure strategies and a continuum between them:

Intensity (amplitude) stereo: the two channels are time-aligned and differ only in level. The phantom image position is set entirely by $\Delta L$ . This is what a mixing-console pan pot produces, and what coincident microphone techniques capture.
Time (phase) stereo: the two channels are equal in level and differ only in arrival time $\Delta t$ . The image position is set by the inter-channel delay. This is approximated by widely spaced microphones (the AB technique).

Both produce convincing frontal images, but they exploit slightly different perceptual paths and they behave very differently when the two channels are summed.

Trading time for level

Rule of thumb

A rough perceptual equivalence for the standard setup is that a level difference of about $1$ dB and an inter-channel time difference of about $0.1$ ms ( $100\ \mu\text{s}$ ) shift the image by comparable amounts near the center. Full deflection to a loudspeaker is reached at roughly $\Delta L \approx 15\text{–}18$ dB or $\Delta t \approx 1.0\text{–}1.5$ ms. These are not laws of physics but measured psychoacoustic averages (Blauert, Williams), and they depend on signal content, but they give the order of magnitude.

The continuum matters because real microphone arrays usually produce both differences at once. A near-coincident pair (like ORTF) has capsules that are both angled apart (giving level differences from their directivity) and spaced apart (giving time differences). Such arrays sit between the pure intensity and pure time cases and combine the two cues — often more naturally than either pure form, because that combination resembles what a real head produces.

Mono compatibility: the decisive difference

The single most important practical distinction between intensity and time stereo is mono compatibility — what happens when you sum $L$ and $R$ to a single channel (for a mono speaker, a phone earpiece, a sum bus, AM broadcast, a smart-speaker downmix).

For intensity stereo, the two channels are coherent and time-aligned, so summing simply adds two scaled copies of the same waveform. Nothing cancels; the mono sum is full and well-behaved. A center image (equal $L$ and $R$ ) sums to twice the amplitude (and we manage that with the panning law, below); a panned source simply changes level slightly. Intensity stereo is fully mono compatible.

For time stereo, summing two equal-level copies offset by $\Delta t$ creates comb filtering. Two identical signals delayed by $\Delta t$ and added produce a transfer function

H(f) = 1 + e^{-j 2\pi f \Delta t}, \qquad |H(f)| = 2\left|\cos(\pi f \Delta t)\right|.

This has nulls wherever $\cos(\pi f \Delta t) = 0$ , i.e. at frequencies

f_n = \frac{2n-1}{2\,\Delta t}, \quad n = 1, 2, 3, \dots

and peaks at $f_n = n/\Delta t$ . The mono sum is colored by a comb whose teeth are spaced $1/\Delta t$ apart. With $\Delta t = 0.5$ ms, the first null sits at $f_1 = 1/(2 \times 0.5\,\text{ms}) = 1$ kHz, with further nulls at $3$ , $5$ , $7$ kHz, etc. — squarely in the most sensitive region of hearing.

Common mistake

Time stereo is therefore poorly mono compatible, and this is the principal reason that mixing consoles, broadcast practice, and object-based panners overwhelmingly use intensity panning.

This connects directly to the comb-filtering and interference physics discussed in distance and air.

Numeric example: comb depth from a spaced pair

Take a spaced (AB) recording with an effective inter-channel delay of $\Delta t = 0.3$ ms for an off-center source and sum to mono. The first null is at

f_1 = \frac{1}{2 \times 0.0003\ \text{s}} = 1667\ \text{Hz},

with the comb period $1/\Delta t = 3.33$ kHz, so nulls fall at $1.67$ , $5.0$ , $8.3$ kHz. At a null the two contributions cancel completely (if equal level) — a $-\infty$ dB notch in theory, in practice a deep, audible dip. The same source captured by a coincident pair (intensity only) sums to mono with no notches at all. This single comparison explains a great deal of why coincident techniques are favored where downmix robustness matters.

The Standard Geometry: The 60-Degree Triangle

Why plus and minus 30 degrees

The reference two-channel layout, codified in ITU-R BS.775 and standard studio practice, places the two loudspeakers and the listener at the vertices of an equilateral triangle: the speakers subtend $\pm 30^\circ$ from the listener's forward axis, a total included angle of $60^\circ$ , with the listener facing the midpoint and at the same distance from each speaker as the speakers are from each other. The listening position on the axis of symmetry is the sweet spot.

The $60^\circ$ angle is a compromise discovered empirically and confirmed perceptually. Make the angle wider (say $\pm 45^\circ$ or more) and the center image becomes unstable: phantom images near the middle weaken and tend to "pull apart" toward the two speakers, leaving a "hole in the middle." Make the angle narrower (say $\pm 15^\circ$ ) and the stereo image is stable but narrow — little spatial spread, approaching mono. At $\pm 30^\circ$ the frontal phantom images are reasonably stable across the whole arc while the width is generous, and the interaural cues produced by inter-channel level differences map smoothly onto plausible source directions. The equilateral geometry also guarantees equal path lengths from both speakers to the central listener, so a centered, equal-level signal arrives time-aligned and level-matched at the sweet spot, producing a solid center phantom.

The listening axis and symmetry

Two geometric facts dominate stereo quality. First, the listening axis — the perpendicular bisector of the line joining the speakers — is the only locus where a centered signal arrives with zero path difference between the two speakers. Step off that axis and the nearer speaker arrives earlier and louder; by the precedence effect (below) the center image jumps toward it. Second, the geometry should be symmetric: equal distances, equal angles, matched levels, matched speakers, and a symmetric room. Any asymmetry biases every phantom image.

Numeric example: path difference off-axis

In the equilateral setup, suppose the speakers are $2.0$ m apart and the listener is $\sqrt{3}\approx1.73$ m back on axis (equilateral height for a $2$ m base). On axis, both speakers are $\sqrt{1.0^2 + 1.73^2} = 2.0$ m away — equal. Now shift the listener $0.30$ m to the right along the speaker line. Distances become

d_R = \sqrt{(1.0-0.30)^2 + 1.73^2} = \sqrt{0.49 + 3.0} \approx 1.868\ \text{m},

d_L = \sqrt{(1.0+0.30)^2 + 1.73^2} = \sqrt{1.69 + 3.0} \approx 2.166\ \text{m}.

The path difference is $\Delta d = 0.298$ m, giving an arrival time difference of

\Delta t = \frac{\Delta d}{c} = \frac{0.298\ \text{m}}{343\ \text{m/s}} \approx 0.87\ \text{ms}.

That is already near the $\sim 1$ ms full-deflection threshold: a $0.30$ m sideways shift is enough to pull a nominally centered image almost entirely onto the right speaker, illustrating how tightly the good image is bound to the sweet spot.

The Panning Law in Full

Why a linear crossfade leaves a hole

The mixing console's job is to position a mono source between the two speakers by choosing two gains, $g_L$ and $g_R$ . The naive choice is a linear (amplitude) crossfade: for a pan parameter $p$ running $0$ (left) to $1$ (right), set $g_L = 1-p$ and $g_R = p$ . At center, $p = 0.5$ , both gains equal $0.5$ , so $g_L + g_R = 1$ , matching the gain at either extreme ( $g = 1$ ). This is correct if signals add by amplitude. But two loudspeakers in a room, especially off the exact symmetry axis and for most of the spectrum, are largely uncorrelated at the ears in the sense that their powers add rather than their amplitudes.

The acoustic loudness you perceive tracks power, $\propto g^2$ . Compare the two cases:

At an extreme (hard left): one speaker at $g=1$ . Total power $\propto 1^2 = 1$ .
At center with a linear law: both speakers at $g = 0.5$ . If powers add, total power $\propto 0.5^2 + 0.5^2 = 0.5$ .

Common mistake

The center is therefore $10\log_{10}(0.5) = -3.01$ dB quieter than the edges. The image level dips as it crosses the middle — the audible "hole in the middle" of a linear pan.

(If the signals instead added perfectly in amplitude, as at very low frequencies on the exact axis, the center would be $g_L + g_R = 1$ , the same as the edges, and no dip — but that fully-correlated condition does not hold across the band.)

Constant-power (sine/cosine) law

To keep perceived loudness constant as the source pans, hold the summed power constant:

g_L^2 + g_R^2 = 1 \quad (\text{constant, independent of pan position}).

The natural parameterization that satisfies this identically is the sine/cosine (constant-power) law. Let $\theta$ run from $0$ (hard left) to $\pi/2$ (hard right):

g_L = \cos\theta, \qquad g_R = \sin\theta, \qquad g_L^2 + g_R^2 = \cos^2\theta + \sin^2\theta = 1.

At center, $\theta = \pi/4$ , $g_L = g_R = \cos(\pi/4) = \frac{1}{\sqrt 2} \approx 0.707$ . The center gain is $0.707$ rather than the linear law's $0.5$ — i.e. $+3$ dB hotter per channel — which exactly fills the hole: the summed power is $0.707^2 + 0.707^2 = 1$ , the same as at the edges. Constant-power panning is the default for uncorrelated/diffuse summation and is the standard choice in DAWs and game audio engines.

The tangent law (and the sine law) for image position

Keeping loudness constant is one requirement; placing the image at a specific angle is another. Two classical laws predict the perceived azimuth $\varphi$ of the phantom image from the gains, for the $\pm\theta_0$ loudspeaker pair.

The stereophonic law of sines (low frequencies, based on ILD-free wavefront addition) states

\frac{\sin\varphi}{\sin\theta_0} = \frac{g_L - g_R}{g_L + g_R}.

Bernard Bauer's refinement, the tangent law, better matches localization when the listener's head turns toward the image and is the form most used in practice (it is also the form Pulkki's VBAP generalizes — see amplitude panning):

\frac{\tan\varphi}{\tan\theta_0} = \frac{g_L - g_R}{g_L + g_R}.

Here $\theta_0 = 30^\circ$ for the standard pair. Given a desired image angle $\varphi$ , solve for the gain ratio, then impose the constant-power constraint $g_L^2 + g_R^2 = 1$ to fix the individual gains.

The mono fold-down choice: -3, -4.5, or -6 dB

The center-gain question — how much to attenuate each channel at center relative to the edges — has no single right answer because it depends on how the two channels combine downstream:

If powers add (uncorrelated, diffuse, or stereo playback): constant power, center each channel at $-3$ dB, so the summed power is flat.
If amplitudes add (perfectly correlated, e.g. a mono fold-down where $L+R$ is summed coherently): constant amplitude (the linear law), center each channel at $-6$ dB, so the summed voltage is flat.
A common compromise used in many consoles centers each channel at $-4.5$ dB, splitting the difference so the result is acceptable both in stereo and in a coherent mono sum.

The table below summarizes the three laws (center-channel gains shown per channel relative to a hard-panned reference of $0$ dB).

Law	$g_L, g_R$ at center	Center gain/ch	Flat in stereo (power)	Flat in mono sum (amplitude)	Typical use
Linear (constant amplitude)	$0.5, 0.5$	$-6.0$ dB	No ( $-3$ dB dip)	Yes	Coherent mono, simple fades
Constant power (sine/cosine)	$0.707, 0.707$	$-3.0$ dB	Yes	No ( $+3$ dB bump)	DAWs, games, default
Compromise (e.g. $-4.5$ dB)	$\approx 0.595$	$-4.5$ dB	$\approx -1.5$ dB	$\approx +1.5$ dB	Broadcast, mono-safe mixes

Worked example: gains across several pan positions

Take the constant-power sine/cosine law with $\theta$ sweeping $0 \to \pi/2$ . We compute the per-channel gains and the inter-channel level difference $\Delta L = 20\log_{10}(g_R/g_L)$ , and the predicted image angle from the tangent law with $\theta_0 = 30^\circ$ .

Pan	$\theta$	$g_L = \cos\theta$	$g_R = \sin\theta$	$g_L^2+g_R^2$	$\Delta L$ (dB)	$\dfrac{g_L-g_R}{g_L+g_R}$	$\varphi$ (tangent law)
Hard L	$0^\circ$	$1.000$	$0.000$	$1.00$	$-\infty$	$+1.000$	$-30^\circ$
Mid L	$22.5^\circ$	$0.924$	$0.383$	$1.00$	$-7.66$	$+0.414$	$\approx -13.4^\circ$
Center	$45^\circ$	$0.707$	$0.707$	$1.00$	$0.00$	$0.000$	$0^\circ$
Mid R	$67.5^\circ$	$0.383$	$0.924$	$1.00$	$+7.66$	$-0.414$	$\approx +13.4^\circ$
Hard R	$90^\circ$	$0.000$	$1.000$	$1.00$	$+\infty$	$-1.000$	$+30^\circ$

Check the "Mid R" image angle: the tangent law gives

\tan\varphi = \tan(30^\circ)\times 0.414 = 0.5774 \times 0.414 = 0.239 \implies \varphi = \arctan(0.239) \approx 13.4^\circ.

Notice that the summed power column is flat at $1.00$ everywhere — the constant-power guarantee — while $\Delta L$ grows without bound as the source approaches a speaker. Notice also that the image angle is not linear in pan: equal steps in $\theta$ give compressed angular steps near the speakers, which is exactly why precise center positioning is easy but hard-panned images feel "stuck" to the speakers.

Mid/Side Representation

The encoding made explicit

Any stereo signal can be losslessly transformed into a Mid/Side (M/S) pair, which makes the "stereo is an encoding" claim almost tangible. Define

M = \frac{L + R}{\sqrt 2}, \qquad S = \frac{L - R}{\sqrt 2},

with the inverse

L = \frac{M + S}{\sqrt 2}, \qquad R = \frac{M - S}{\sqrt 2}.

(The $1/\sqrt 2$ keeps the transform power-preserving and self-inverse; many practical implementations drop the scaling and use $M = L+R$ , $S = L-R$ , which is the form Blumlein and most engineers quote.) The Mid channel is the sum — everything common to both channels, the mono-compatible core, where centered sources live. The Side channel is the difference — everything that distinguishes the two channels, i.e. the spatial information itself: width, off-center sources, decorrelated ambience.

M/S separates the two physical quantities the duplex cues care about. Center-panned content (equal $L$ and $R$ ) has $S = 0$ : a centered source is pure Mid. Hard-panned or anti-phase content lives in $S$ . The ratio of Side energy to Mid energy is essentially a measure of stereo width, and the inter-channel correlation that governs phantom-image stability (and mono compatibility) is encoded in how much energy sits in $S$ versus $M$ . This is exactly the basis of the upmixing approach DAM Audio explores — extracting and re-distributing the $M$ and $S$ components across more speakers; see the discussion in HSR stereo upmixing for multi-speaker systems.

What M/S reveals and how engineers use it

Mono check

Because $M$ is the mono sum and $S$ is the part that vanishes in mono, M/S is the natural domain for thinking about compatibility: a mix with huge Side energy and little Mid will sound wide in stereo but thin or hollow in mono.

Engineers exploit M/S directly — M/S equalization (brighten the sides, anchor the center), M/S compression (control width dynamically), and the width control that simply scales $S$ : multiply $S$ by a factor $w$ and reconstruct. With $w = 0$ the signal collapses to mono; $w = 1$ is unchanged; $w > 1$ exaggerates width (at growing risk to mono compatibility). A coincident Blumlein/MS microphone rig captures $M$ and $S$ directly: a forward cardioid (or omni) as Mid and a sideways figure-of-eight as Side, with width set after the fact by the $M$ -to- $S$ ratio — a powerful demonstration that the spatial encoding can be recorded as such.

Worked example: centered vs hard-panned source

Centered source. A mono source panned to center with the constant-power law has $L = R = 0.707\,x$ for source $x$ . Then

M = \frac{0.707x + 0.707x}{\sqrt 2} = \frac{1.414x}{1.414} = x, \qquad S = \frac{0.707x - 0.707x}{\sqrt 2} = 0.

All energy is in Mid, none in Side. In mono ( $L+R$ ) the source is fully present; width is zero. This is the maximally mono-compatible case.

Hard-panned source. The same source panned hard right: $L = 0$ , $R = x$ . Then

M = \frac{0 + x}{\sqrt 2} = 0.707x, \qquad S = \frac{0 - x}{\sqrt 2} = -0.707x.

Energy splits equally between Mid and Side: $|M|^2 = |S|^2 = 0.5|x|^2$ . The Side channel is now non-zero and carries the "it's on the right" information. In a coherent mono sum $L + R = x$ — the source survives, at the same amplitude as a centered source would in the unscaled convention — but the positional information ( $S$ ) is discarded; mono cannot tell left from right. The contrast between $S=0$ (centered) and $|S|=|M|$ (hard-panned) is the M/S signature of where a source sits, and it is the cleanest illustration that the spatial scene is literally the difference signal.

Sweet Spot, Precedence, and Off-Axis Collapse

The sweet spot

Everything above assumes the listener sits at the apex of the equilateral triangle, on the axis of symmetry. There, the two speakers are equidistant and a centered signal fuses cleanly. The sweet spot is the small region around that point where the geometry is close enough to symmetric that phantom images remain where the mix intended. It is small precisely because, as the off-axis path-difference example showed, fractions of a millisecond of arrival-time asymmetry are enough to drag images sideways.

The precedence (Haas) effect

When the same sound arrives from two directions separated by a delay in the range of roughly $1$ to $\sim 40$ ms, the auditory system fuses them into a single perceived event and localizes it toward the first-arriving sound — the later arrival is largely suppressed for localization even if it is somewhat louder. This is the precedence effect (or law of the first wavefront; the Haas effect is the closely related observation about the level of a delayed echo before it is heard as separate). The fusion window is why a slightly off-center listener still hears a coherent (if displaced) image rather than two separate speakers, and why early room reflections do not destroy localization: the direct sound "wins." Within the first $\sim 1$ ms, however, precedence does not dominate — instead summing localization operates, and the time difference smoothly shifts the fused image. So there are two regimes: below $\sim 1$ ms, summing localization (image shifts continuously); from $\sim 1$ to $\sim 40$ ms, precedence (image locks to the earlier source); beyond, the late arrival is heard as a separate echo. These thresholds are detailed in psychoacoustics.

Off-axis image collapse

Combine the two effects and you get the characteristic failure of stereo for a listener who is not centered. Sit closer to the right speaker and its sound arrives both earlier (precedence) and louder (inverse-square level difference). Both cues now agree: pull everything toward the right. The carefully balanced phantom images collapse onto the nearer speaker, and the center vanishes. This is why stereo is fundamentally a single-listener technology in its ideal form, and why every multi-speaker successor — surround, object-based, Ambisonics, WFS — is in part an attempt to widen the region over which images stay put.

Numeric example: precedence vs level trade-off

Suppose an off-center listener gets the right speaker $3$ dB louder and $0.4$ ms earlier than the left. Both cues point right, reinforcing. To restore a centered image acoustically you would have to delay the right channel by $\approx 0.4$ ms (cancel the time cue) and lower it $\approx 3$ dB (cancel the level cue) — i.e. re-center for that seat only, at the cost of every other seat. There is no two-channel solution that centers the image for all listeners simultaneously; the sweet spot is intrinsic. (Cross-talk-cancellation systems attack this from another angle entirely — see transaural.)

Capture Preview: Microphone Techniques

Stereo can be synthesized (pan pots) or captured. The capture side belongs to the recording part of this guide (covered there by name, not linked here), but a preview clarifies how the intensity/time continuum appears in microphone practice and previews the encode step for real acoustic scenes.

Coincident pairs: XY and Blumlein (pure intensity)

In a coincident pair, two directional capsules sit at effectively the same point, angled apart. Because they share a location, source distance is identical for both, so there is essentially no time difference — direction is encoded purely as a level difference via the capsules' directivity. XY uses two cardioids (gain $g(\theta) = 0.5 + 0.5\cos\theta$ ) crossed at $90$ – $135^\circ$ ; the Blumlein pair uses two figure-of-eights crossed at $90^\circ$ (worked above). Coincident pairs are perfectly mono compatible (intensity stereo) and give precise, stable imaging, at the cost of a somewhat "dry," less enveloping spatial impression because they lack the time-difference cues a real head receives.

Near-coincident: ORTF (intensity plus time)

ORTF (named for the French broadcaster) uses two cardioids spaced $17$ cm apart and angled $110^\circ$ — roughly mimicking the spacing and shadowing of a human head. The spacing introduces time differences while the angling introduces level differences, so ORTF lives in the middle of the continuum and produces a spatial impression many engineers find natural. Mono compatibility is intermediate: the modest spacing causes some comb filtering on summation, but far less catastrophic than wide AB. The $17$ cm spacing gives a maximum inter-channel delay of about $17\ \text{cm}/343\ \text{m/s} \approx 0.5$ ms for a fully lateral source — placing its first mono-sum comb null near $1$ kHz, as computed earlier, which is why ORTF mixes are checked in mono.

Spaced pairs: AB (pure time, mostly)

AB uses two (often omnidirectional) microphones spaced widely apart — anywhere from $40$ cm to several meters. With omni capsules there is no directivity-based level difference, so direction is encoded almost entirely as time difference. Spaced pairs deliver a strong sense of spaciousness and low-frequency warmth (omnis extend low and capture room energy), but they are poorly mono compatible (large $\Delta t$ means dense, low-frequency comb filtering) and give less precise imaging. AB is pure(ish) time stereo, the opposite pole from Blumlein.

Technique	Capsules	Spacing	Angle	Cue type	Mono compatibility	Imaging
Blumlein	2 × figure-8	coincident	$90^\circ$	Intensity	Excellent	Very precise
XY	2 × cardioid	coincident	$90$ – $135^\circ$	Intensity	Excellent	Precise
ORTF	2 × cardioid	$17$ cm	$110^\circ$	Intensity + time	Good	Natural
AB	2 × omni	$40$ cm– $2$ m+	$0^\circ$	Time	Poor	Spacious, vague

The progression Blumlein → XY → ORTF → AB is a tour across the entire intensity-to-time continuum, with mono compatibility and imaging precision trading off against spaciousness exactly as the theory predicts.

Limits of Stereo and Why It Seeds Everything

What stereo cannot do

Two-channel stereo is a remarkable amount of spatial information from very little, but its limits are structural, not incidental:

Frontal arc only. Phantom images can be placed only on (and slightly beyond, with tricks) the $60^\circ$ arc between the two front speakers. There is no way to place a stable image to the side, behind, or above the listener with two frontal channels. Stereo reproduces a frontal stage, not a surrounding field. Surround and matrix systems exist precisely to extend the arc — see surround and matrix.
Sweet-spot bound. As shown, off-axis listeners suffer image collapse via combined level and precedence cues. The good image exists for essentially one listener in one seat.
No height, no true envelopment. With two horizontal frontal channels there is no vertical cue and no genuine envelopment (sound coming from many directions at once). A sense of space can be suggested via reverberation and decorrelation, but it is a frontal illusion, not a surrounding sound field. Genuine envelopment requires more channels and is treated under direct, diffuse, and envelopment and reverberation.
Distance is indirect. Stereo encodes direction (azimuth) far better than distance; depth is conveyed only through level, reverberation, and air absorption cues, covered in distance and air.
Phantom instability across frequency. The interaural cues synthesized by summing localization are not perfectly consistent across the whole band (the ITD-like cue from level panning is frequency-dependent), so phantom images can be slightly blurry or shift with frequency compared with a real source — a known limitation that Ambisonic decoding theory (Gerzon's velocity/energy vectors) quantifies and that motivates higher-order systems.

Key takeaway

Despite these limits, stereo is the generative case. Each subsequent technique in Part II keeps the core idea — encode a scene into channels, decode via real transducers and the listener's ears — and relaxes one of stereo's constraints.

Why every later method grows from this seed

Amplitude panning generalizes the two-speaker tangent law to arbitrary speaker layouts (pairwise and 3D) — amplitude panning.
Surround/matrix extends the frontal arc to a full horizontal ring and adds matrix encode/decode in the spirit of Blumlein's shuffler — surround and matrix.
Object-based audio stores the source positions as metadata and renders the panning at playback to whatever layout is present — object-based.
Ambisonics makes "encode the field, decode to any speakers" mathematically exact via spherical harmonics, with Blumlein's figure-of-eights reappearing as the first-order components — Ambisonics.
Binaural synthesizes the actual ear signals with HRTFs, bypassing loudspeaker summation entirely — binaural.
Transaural delivers binaural signals over loudspeakers by canceling the very cross-talk that stereo summing localization relies on — transaural.
Wave field synthesis abandons phantom imaging for the physical reconstruction of the wavefront itself, removing the sweet-spot limit — WFS.

In every one of these, you will recognize the stereo skeleton: coherent feeds to multiple transducers, inter-channel differences carrying geometry, and the listener's duplex cues doing the final decoding. Master the phantom image and you have the conceptual key to all of them. For the broader map of where these techniques sit, return to the techniques overview and the formats reference.

Summary

The phantom image is the perceptual phenomenon at the heart of all loudspeaker spatialization: two coherent channels are fused by summing localization into a single source whose position is set by inter-channel level and time differences, because those differences manufacture, at the listener's ears, the very ITD/ILD duplex cues that natural hearing uses. Blumlein's 1931 patent established that this is an encoding — two channels carrying one spatial signal — and that intensity and time encodings are interconvertible. Intensity stereo is mono-compatible; time stereo combs on summation. The standard equilateral $\pm 30^\circ$ geometry stabilizes frontal images for a centered listener. The panning law must hold summed power constant ( $g_L^2 + g_R^2 = 1$ , the sine/cosine law) to avoid the $-3$ dB center hole, while the tangent law maps gains to image angle; the $-3$ / $-4.5$ / $-6$ dB choice depends on whether power or amplitude adds downstream. The Mid/Side transform exposes the encoding directly: the center lives in $M$ , the spatial scene lives in $S$ . The sweet spot, precedence, and off-axis collapse bound the technology to essentially one listener and a frontal arc with no height or true envelopment — limits that every later method in this guide is built to relax, while keeping stereo's encode-then-decode DNA.

References

Blumlein, A. D. (1931). Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems. British Patent 394,325. (Reprinted in the Journal of the Audio Engineering Society, 1958.)
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised edition). MIT Press, Cambridge, MA.
Rumsey, F. (2001). Spatial Audio. Focal Press, Oxford.
Lipshitz, S. P. (1986). "Stereo Microphone Techniques: Are the Purists Wrong?" Journal of the Audio Engineering Society, 34(9), 716–744.
Bauer, B. B. (1961). "Phasor Analysis of Some Stereophonic Phenomena." Journal of the Acoustical Society of America, 33(11), 1536–1539.
Gerzon, M. A. (1992). "Panpot Laws for Multispeaker Stereo." Presented at the 92nd AES Convention, Vienna, Preprint 3309.
Pulkki, V. (1997). "Virtual Sound Source Positioning Using Vector Base Amplitude Panning." Journal of the Audio Engineering Society, 45(6), 456–466.
Williams, M. (1987). "Unified Theory of Microphone Systems for Stereophonic Sound Recording." Presented at the 82nd AES Convention, London, Preprint 2466.
ITU-R Recommendation BS.775. Multichannel Stereophonic Sound System with and without Accompanying Picture. International Telecommunication Union, Geneva.

← Back to Spatialization Techniques

The Phantom Image and Summing Localization​

Two coherent sources, one perceived source​

How loudspeaker differences become ear differences​

Numeric example: image shift from a level difference​

Blumlein and the Central Claim: Stereo Is an Encoding​

The 1931 patent​

Two channels, one spatial signal​

Numeric example: the Blumlein pair as a level encoder​

Intensity Stereo, Time Stereo, and the Continuum​

Two roads to the same image​

Trading time for level​

Mono compatibility: the decisive difference​

Numeric example: comb depth from a spaced pair​

The Standard Geometry: The 60-Degree Triangle​

Why plus and minus 30 degrees​

The listening axis and symmetry​

Numeric example: path difference off-axis​

The Panning Law in Full​

Why a linear crossfade leaves a hole​

Constant-power (sine/cosine) law​

The tangent law (and the sine law) for image position​

The mono fold-down choice: -3, -4.5, or -6 dB​

Worked example: gains across several pan positions​

Mid/Side Representation​

The encoding made explicit​

What M/S reveals and how engineers use it​

Worked example: centered vs hard-panned source​

Sweet Spot, Precedence, and Off-Axis Collapse​

The sweet spot​

The precedence (Haas) effect​

Off-axis image collapse​

Numeric example: precedence vs level trade-off​

Capture Preview: Microphone Techniques​

Coincident pairs: XY and Blumlein (pure intensity)​

Near-coincident: ORTF (intensity plus time)​

Spaced pairs: AB (pure time, mostly)​

Limits of Stereo and Why It Seeds Everything​

What stereo cannot do​

Why every later method grows from this seed​

Summary​

References​