Stereo Microphone Techniques

Two microphones, correctly placed, can capture a sound field that — played over two loudspeakers or headphones — recreates a convincing sense of where each instrument sat and how the room breathed around it. This is the oldest and still the most important problem in spatial capture, and everything in surround recording and immersive 3D recording is, structurally, an extension of the two-channel decisions you make here.

This chapter develops every classic stereo pair from first principles. The single idea to keep in front of you is the one introduced in Principles of Spatial Capture: capture and reproduction are duals. When we reproduce stereo, panning a source between two speakers uses either a level difference (amplitude panning, see Stereo and Amplitude Panning) or, in some systems, a time difference. When we capture stereo, the microphone array converts a source's physical direction into exactly those same two cues — an inter-channel level difference (ICLD) or an inter-channel time difference (ICTD) — which the listener's ears then re-decode into the interaural cues (ILD and ITD) described in Psychoacoustics. A microphone technique is therefore nothing more than a direction-to-cue encoder, and the family it belongs to is defined by which cue it generates.

1. The three families and their trade map

Every two-microphone stereo technique falls into one of three families, distinguished by the spacing between the capsules and therefore by the cue they primarily encode.

Coincident pairs place two directional capsules at (ideally) the same point in space, one above the other, with their diaphragms vertically stacked so that sound from any horizontal direction arrives at both at the same instant. Because there is no path-length difference, there is no time/phase difference between channels: direction is encoded purely as a level difference, set by the difference between the two polar patterns at the source's angle. This is intensity stereo, the exact capture-side analogue of amplitude panning.
Spaced pairs separate two (usually omnidirectional) capsules by tens of centimetres to a couple of metres. A wavefront arriving off-axis reaches the near mic before the far mic, so direction is encoded primarily as a time difference, $\Delta t = \Delta d / c$ , where $\Delta d$ is the extra path length and $c \approx 343\ \mathrm{m/s}$ . This is time-of-arrival stereo.
Near-coincident pairs sit in between: two directional capsules separated by a small distance (typically 17–30 cm) and splayed apart by an angle. They generate both a modest level difference (from the angled directional patterns) and a modest time difference (from the small spacing). They deliberately exploit both ear cues at once.

The reason three families exist — rather than one "best" method — is that the cues trade against each other. Level-difference (coincident) stereo gives the sharpest, most stable phantom images and is perfectly mono-compatible, because summing two coincident channels never produces destructive interference. But pure intensity stereo tends to sound somewhat "flat" or two-dimensional, with less of the spacious, enveloping quality that real spaces have, because it lacks the decorrelation between channels that the ear reads as spaciousness (see Direct, Diffuse and Envelopment). Time-difference (spaced) stereo does the opposite: it produces gloriously wide, enveloping, decorrelated sound, especially in the bass, but its images are diffuse and can wander, and it carries a real risk of comb filtering when the channels are summed to mono. Near-coincident techniques are engineering compromises that buy much of the spaciousness of spaced arrays while keeping images reasonably tight and mono behaviour tolerable.

Property	Coincident (XY, Blumlein, MS)	Near-coincident (ORTF, NOS, DIN)	Spaced (AB)
Primary cue encoded	Level (ICLD)	Level + time	Time (ICTD)
Capsule spacing	~0 cm	17–30 cm	40 cm – 2 m+
Image sharpness	Very sharp, stable	Good	Diffuse, can wander
Sense of width/space	Modest	Wide	Very wide
Envelopment / "air"	Low–moderate	Good	Excellent (esp. LF)
Mono compatibility	Excellent (no phase cancel)	Good	Poor (comb filtering risk)
Low-frequency weight	Light (directional capsules)	Light–moderate	Heavy (omnis)
Adjustable in post	MS: yes (width); others: no	No	No

Key takeaway

The three families are not "better" and "worse"; they are points on a single trade surface whose axes are localization accuracy, envelopment, and mono compatibility. Choosing a technique is choosing where on that surface your recording should sit.

A central concept that recurs below is the Stereophonic Recording Angle (SRA), sometimes called the recording angle or the angle of acceptance. The SRA is the total angle, measured at the microphone, subtended by the part of the sound stage that will be reproduced across the full width of the loudspeaker pair — from hard-left speaker to hard-right speaker. A source at the edge of the SRA images at the corresponding speaker; a source in the middle images centre; a source outside the SRA images "in the speaker" and cannot move further out (it piles up at the edge). Matching the SRA to the angular width of your ensemble is the single most important geometric decision in stereo capture, and Michael Williams' tables of SRA versus pattern, angle, and spacing are the standard reference for getting it right.

2. Coincident XY: angled directional pairs

The XY pair is the archetype of pure intensity stereo. Take two identical directional microphones — most often cardioids — and place their capsules one directly above the other so they are coincident in the horizontal plane. Splay them symmetrically about the centre line by a total included angle $2\alpha$ (each capsule pointing $\alpha$ off centre). Common choices are $90^\circ$ and $120^\circ$ , sometimes up to $135^\circ$ .

How the patterns encode direction

Because the capsules are coincident, a source at horizontal angle $\theta$ (measured from the array's forward axis, positive toward the right) reaches both at the same time. The only thing that differs is the sensitivity of each capsule at that angle, governed by its polar pattern. A first-order microphone has the directional response

g(\phi) = A + (1-A)\cos\phi

where $\phi$ is the angle between the source and the capsule's own axis, and $A$ is the pattern parameter: $A=1$ omni, $A=0.5$ cardioid, $A=0$ figure-of-eight, with $A=0.37$ a supercardioid and $A=0.25$ a hypercardioid. For the right-pointing capsule $\phi_R = \theta - \alpha$ and for the left-pointing capsule $\phi_L = \theta + \alpha$ . The inter-channel level difference, in decibels, is

\Delta L(\theta) = 20\log_{10}\frac{g(\theta-\alpha)}{g(\theta+\alpha)}.

That level difference is the only stereo information XY produces. The reproduction system then turns it back into a phantom image: roughly, a channel level difference of about $15{-}18\ \mathrm{dB}$ pushes the image fully to one speaker, and $0\ \mathrm{dB}$ keeps it centred (the same encode/decode relationship discussed for amplitude panning in Stereo).

Worked example: SRA of a 90° cardioid XY

Take two cardioids ( $A=0.5$ ) at a total angle of $90^\circ$ , so $\alpha = 45^\circ$ . We find the source angle $\theta$ at which the level difference reaches roughly $16\ \mathrm{dB}$ — the value that maps to a hard-left or hard-right image. The full SRA is then $2\theta$ .

The right capsule points at $+45^\circ$ , the left at $-45^\circ$ . Compute the cardioid gains as the source swings right:

At $\theta = 0^\circ$ : both gains $= 0.5 + 0.5\cos 45^\circ = 0.854$ . $\Delta L = 0\ \mathrm{dB}$ → centre. Good.
At $\theta = 45^\circ$ (source on the right capsule's axis): $g_R = 0.5+0.5\cos 0^\circ = 1.0$ ; $g_L = 0.5 + 0.5\cos 90^\circ = 0.5$ . $\Delta L = 20\log_{10}(1.0/0.5) = 6.0\ \mathrm{dB}$ .
At $\theta = 65^\circ$ : $g_R = 0.5+0.5\cos 20^\circ = 0.970$ ; $g_L = 0.5+0.5\cos 110^\circ = 0.329$ . $\Delta L = 20\log_{10}(0.970/0.329) = 9.4\ \mathrm{dB}$ .
At $\theta = 90^\circ$ : $g_R = 0.5+0.5\cos 45^\circ = 0.854$ ; $g_L = 0.5+0.5\cos 135^\circ = 0.146$ . $\Delta L = 20\log_{10}(0.854/0.146) = 15.3\ \mathrm{dB}$ .

So a source about $90^\circ$ off the centre line produces roughly the $\approx 16\ \mathrm{dB}$ needed for a hard-edge image, giving an SRA on the order of $\pm 90^\circ$ , i.e. a very wide $\sim 180^\circ$ acceptance. That is a hallmark of $90^\circ$ cardioid XY: it captures a wide arc but with relatively gentle level gradients, so the centre of the stage occupies a lot of the reproduced width and the extreme sides feel compressed. To narrow the SRA — concentrate a smaller ensemble across the full stereo width — you either widen the included angle (more $\alpha$ steepens the gradient) or switch to a more directional pattern (hypercardioid or figure-of-eight). Williams' SRA charts tabulate exactly these combinations.

Rule of thumb

For XY, widening the included angle and/or using a tighter polar pattern narrows the recording angle (the stage spreads wider across the speakers). If a chamber group images bunched in the middle, increase the splay from $90^\circ$ toward $120{-}135^\circ$ before you reach for closer placement.

Strengths and weaknesses

XY's great virtues are rock-solid central imaging and flawless mono compatibility — since the channels are time-aligned, summing them only changes level, never phase. Its weaknesses are two. First, off-axis colouration: real cardioids are not perfectly cardioid at high frequencies, so sources at the sides are picked up through a coloured, often dull off-axis response, and because that side energy is the stereo signal, the timbral penalty matters. Second, the relative lack of channel decorrelation makes XY sound less spacious and enveloping than spaced or near-coincident methods — it is accurate but can feel "small". XY is consequently a workhorse for broadcast, sources where mono-safety is paramount, and tight spaces, rather than for lush concert-hall recordings.

3. Coincident Blumlein: crossed figure-of-eights

Alan Blumlein's 1931 patent (the foundational document of stereophony) describes the most elegant coincident array of all: two figure-of-eight microphones, coincident, crossed at exactly $90^\circ$ , each pointing $45^\circ$ off centre. Because the bidirectional pattern is $g(\phi) = \cos\phi$ , the geometry has a beautiful property.

The elegant symmetry

With $A=0$ and $\alpha = 45^\circ$ , the two gains are $g_R = \cos(\theta - 45^\circ)$ and $g_L = \cos(\theta + 45^\circ)$ . Their sum and difference reduce, by the angle-addition identities, to

g_R + g_L = \sqrt{2}\,\cos\theta, \qquad g_R - g_L = \sqrt{2}\,\sin\theta.

The sum (the mono signal) follows a forward-facing figure-of-eight ( $\cos\theta$ ) and the difference (the stereo "side" information) follows a sideways figure-of-eight ( $\sin\theta$ ). The encoded level difference therefore tracks $\tan\theta$ in a clean, almost linear-near-centre way, and over the front $\pm 45^\circ$ quadrant the localization is exceptionally accurate and even — sources are spread across the stage in near-perfect proportion to their true angle. The front SRA of Blumlein is close to $\pm 45^\circ$ (a $90^\circ$ acceptance), which happens to match the angular width many ensembles occupy from a good seat. Many engineers consider Blumlein the most natural-imaging stereo technique that exists.

Rear pickup and the room

A figure-of-eight is equally sensitive front and back, so the crossed pair has four lobes. The two front lobes capture the ensemble; the two rear lobes capture the room behind the microphone — and, crucially, with the opposite left/right sense and opposite polarity. A source arriving from the left-rear quadrant is picked up predominantly by the right-pointing eight's rear lobe, so it images on the right. This is not necessarily a flaw: in a good-sounding hall, the antiphase rear pickup adds a rich, enveloping reverberant return that decorrelates the channels and gives Blumlein much more spaciousness than its small footprint would suggest. But it makes Blumlein unforgiving of the room and of placement. In a poor or noisy room, or too close to reflective side walls, the rear and side lobes drag in unwanted reflections that fold into the image confusingly.

There is also a low-frequency consideration: figure-of-eights are pressure-gradient transducers and roll off in the bass at close range only via proximity effect; at concert distance they have less low-end weight than omnis, so Blumlein recordings can sound slightly lean below 100 Hz compared with a spaced-omni approach. The reward is imaging precision and a wonderfully coherent depth perspective.

Room dependence

Blumlein puts its rear-facing lobes straight onto the room behind the mic, at full sensitivity. In a dead, small, or acoustically ugly space, those lobes will sabotage the recording. Reserve Blumlein for rooms you actually want to hear, and keep the array well away from side walls.

4. Mid/Side (MS): the matrixable coincident pair

Mid/Side is a coincident technique with a unique superpower: its stereo width is continuously adjustable after the fact, and it is perfectly mono-compatible by construction. It is the capture-side twin of the sum-and-difference idea introduced in Stereo.

The matrix

Place two coincident capsules: a Mid microphone (any pattern — omni, cardioid, hypercardioid) pointing straight at the centre of the stage, and a Side microphone that must be a figure-of-eight oriented sideways, its positive lobe to the left and negative lobe to the right (so it is null toward the front). The Mid signal $M$ carries the on-axis, mono content; the Side signal $S = g_8(\theta) \propto \sin\theta$ carries the left/right difference — positive for sources on the left, negative on the right, zero dead centre.

These two signals are not yet left and right. You decode them with a sum-and-difference matrix:

L = M + S, \qquad R = M - S.

A source on the left has $S>0$ , so $L = M+S$ is boosted and $R = M-S$ is reduced — it images left, exactly as desired. A central source has $S=0$ , so $L=R=M$ — dead centre. The decode is identical in form to the MS playback matrix and to Blumlein's sum/difference structure above; indeed Blumlein is simply MS with a figure-of-eight Mid, since a forward eight plus a sideways eight, summed and differenced, regenerate the two crossed eights.

Continuously adjustable width

The decode generalises by scaling the Side level by a width control $w$ before the matrix:

L = M + w\,S, \qquad R = M - w\,S.

Setting $w=0$ yields $L=R=M$ : pure mono, the bare Mid microphone. Increasing $w$ progressively widens the stereo image; large $w$ exaggerates the sides and the room. Because you record $M$ and $S$ as separate tracks, you set $w$ in post — auditioning narrow for a tight pop vocal bus or wide for an orchestra, after the session. No other coincident pair offers this.

Mono compatibility is automatic

Sum the decoded channels:

L + R = (M + wS) + (M - wS) = 2M.

The Side signal cancels exactly, leaving twice the Mid. A folded-to-mono MS recording collapses to the Mid microphone alone — which, being a single sensible pattern aimed at the source, is always clean. There is no comb filtering, no phase cancellation, no level suck-out. This is why MS is beloved in broadcast and film, where a mono fold-down is guaranteed somewhere downstream.

Worked example: width and the effective pattern

Use a cardioid Mid ( $M(\theta)=0.5+0.5\cos\theta$ ) and a figure-of-eight Side ( $S(\theta)=\sin\theta$ ) with width $w=1$ . The effective left-channel pattern is

L(\theta) = M(\theta) + S(\theta) = 0.5 + 0.5\cos\theta + \sin\theta.

Evaluate the left and right gains for a source at $\theta = +30^\circ$ (to the left of centre, by our sign convention positive-left for $S$ — here let positive $\theta$ be left so $\sin\theta>0$ ):

$M = 0.5 + 0.5\cos 30^\circ = 0.5 + 0.433 = 0.933$
$S = \sin 30^\circ = 0.5$
$L = 0.933 + 0.5 = 1.433$ ; $R = 0.933 - 0.5 = 0.433$
$\Delta L = 20\log_{10}(1.433/0.433) = 10.4\ \mathrm{dB}$ to the left — a clearly left-of-centre image.

Now halve the width, $w=0.5$ : $L = 0.933 + 0.25 = 1.183$ , $R = 0.933 - 0.25 = 0.683$ , $\Delta L = 4.8\ \mathrm{dB}$ — the same source pulls back toward centre. Double the width, $w=2$ : $L=1.933$ , $R=-0.067$ — the right channel has gone negative, meaning the effective right pattern now has a rear lobe and the image is pushed hard left with antiphase side content. This shows both the power and the danger of the width knob: small $w$ for safety and focus, large $w$ for drama, but very large $w$ creates out-of-phase energy that can sound unnatural and harm any later mono fold (which, note, still cancels $S$ entirely — the danger is in the stereo extremes, not the mono sum).

MS is the universal coincident pair

Choosing the Mid pattern reshapes the whole technique: an omni Mid gives a warm, room-rich result; a cardioid Mid is the broadcast standard; a hypercardioid Mid tightens the centre. And a figure-of-eight Mid is Blumlein. MS is best understood not as one technique but as the parametric family that contains the others.

Double M/S: extending the matrix to surround

Double M/S adds a second figure-of-eight to a normal M/S pair, this time facing front-to-back, so the coincident cluster becomes a forward Mid (cardioid), a sideways Side (figure-8) and a rear-facing figure-8. Matrixing the front and side eights yields the front L/R exactly as in stereo M/S; matrixing the rear eight against the Mid yields a back-facing L/R pair. Why it exists: it keeps everything coincident — and therefore mono-compatible and continuously adjustable in post — while reaching a surround image from three capsules in one compact body. Strengths: tiny footprint, fully matrixable front/back width after the fact, no spacing artefacts. Weaknesses: being coincident it captures little time-difference envelopment, so the surround field is precise but less "spacious" than a spaced surround array, and it leans heavily on clean figure-8 patterns.

5. Near-coincident pairs: ORTF, NOS, DIN

Near-coincident techniques are the practical compromise that dominates classical and acoustic stereo recording. By splaying two directional capsules and separating them by a small distance, they generate a level difference (from the angle) and a time difference (from the spacing) simultaneously — feeding the listener's ear both of its native localization cues, the ILD and ITD of Psychoacoustics, in roughly the proportions the ear expects from a real source.

Why the small spacing matters

The spacing is chosen to approximate the acoustic separation of the human ears (the head is about 17–18 cm across, ear-to-ear path longer). With capsules ~17 cm apart and angled outward, a source off to one side reaches the near capsule earlier — generating an ICTD — and lies more on-axis for the near capsule — generating an ICLD. The two cues reinforce one another, so near-coincident pairs localise convincingly while the time difference between channels decorrelates them enough to deliver the spaciousness and envelopment that pure XY lacks. The spacing is small enough that, over most of the audio band, the resulting phase differences do not yet produce severe comb filtering in a mono sum, so mono compatibility is merely good rather than perfect.

The named setups

The classic near-coincident configurations are named after the broadcasters and committees that codified them.

Setup	Capsules	Spacing	Included angle	Approx. SRA	Character
ORTF	Two cardioids	17 cm	110°	≈ ±48° (96°)	The reference; natural width, good depth
NOS	Two cardioids	30 cm	90°	≈ ±40° (80°)	Wider spacing, more spacious, narrower SRA
DIN	Two cardioids	20 cm	90°	≈ ±50° (100°)	German broadcast standard, balanced
RAI	Two cardioids	21 cm	100°	≈ ±45° (90°)	Italian broadcast standard; slightly wider splay than DIN
Faulkner (phased array)	Two figure-of-eights	20 cm	0° (parallel)	wide	Forward-facing eights; less side colouration
EBS / "Stereo 180"	Two cardioids	30 cm	60°	≈ ±45°	Wide spacing, narrow splay

ORTF (from the Office de Radiodiffusion-Télévision Française) — two cardioids, capsules 17 cm apart, splayed to a $110^\circ$ included angle — is the canonical near-coincident pair and an excellent default for a single stereo image of an ensemble. Its recording angle of roughly $\pm 48^\circ$ suits a group that subtends about $90{-}100^\circ$ from the microphone, a typical concert perspective. NOS (Nederlandse Omroep Stichting), $30\ \mathrm{cm}$ / $90^\circ$ , trades a touch of central focus for greater spaciousness and a slightly narrower acceptance. DIN ( $20\ \mathrm{cm}$ / $90^\circ$ ) sits between them. All three are "right" — the choice is a matter of how wide the source is and how much room you want.

Practical default

If you do not know which stereo technique to use on an acoustic ensemble in a decent room, set up ORTF, point it at the centre of the group from a height of 2–3 m and a distance roughly equal to the ensemble's width, and you will rarely be far wrong. It is the safest single choice across the widest range of sources.

6. Spaced AB: pure time-difference stereo

Spaced AB is the oldest stereo method in spirit (Bell Labs experimented with spaced arrays in the 1930s, Decca built an empire on them) and the one that sounds, to many ears, the most enveloping. Place two microphones — classically large-diaphragm omnidirectional capsules — separated by a substantial distance, anywhere from $40\ \mathrm{cm}$ to two metres or more, both pointing forward (parallel) or angled slightly out.

The geometry of time-of-arrival stereo

Because omnis are non-directional, the two channels are nearly identical in level regardless of source angle; the stereo information lives almost entirely in the time of arrival. For a source far away at angle $\theta$ off the forward axis, with capsule separation $d$ , the extra path to the far microphone is

\Delta d = d\sin\theta, \qquad \Delta t = \frac{d\sin\theta}{c}.

With $d = 0.5\ \mathrm{m}$ and a source at $\theta = 30^\circ$ : $\Delta d = 0.5\sin 30^\circ = 0.25\ \mathrm{m}$ , so $\Delta t = 0.25/343 \approx 0.73\ \mathrm{ms}$ . An inter-channel time difference of roughly $1.1\ \mathrm{ms}$ is about what fully lateralises a phantom image, so this source images well off to the side. The reproduction system reads that ICTD via the precedence effect (see Psychoacoustics) and places the image toward the earlier speaker.

Why it is enveloping but imprecise

Two consequences follow from encoding direction as time rather than level. First, the low-frequency envelopment is superb. Widely spaced omnis pick up the low-frequency room field at two points far enough apart that the bass is substantially decorrelated between channels, which the ear interprets as a large, enveloping space — and omnis themselves have full, extended low-frequency response with no proximity roll-off. Spaced omnis simply sound big at the bottom in a way coincident directional pairs cannot. Second, the imaging is diffuse and can wander. Time-difference cues are frequency-dependent and ambiguous: a given $\Delta t$ corresponds to different phase angles at different frequencies, so a single source does not collapse to one crisp point but spreads, and its apparent position can shift with frequency or with small head movements. Centre images in particular are weakly defined with only two widely spaced mics — which is precisely why a third, central microphone is so often added (the Decca tree, below).

The fundamental spaced-AB trade is therefore spacing versus image stability: wider spacing increases the time differences (more dramatic width and decorrelation) but worsens the wandering and the comb-filter risk; narrower spacing tightens the image but reduces the spaciousness that motivated using omnis in the first place. Common starting points are $40{-}60\ \mathrm{cm}$ for a focused, still-spacious image and up to a metre or more for a deliberately huge, atmospheric one. A Wide AB (spacing around $60\ \mathrm{cm}$ and beyond) deliberately sits at the spacious end of this trade — chosen for big-room, atmospheric captures where envelopment matters more than a tight centre.

The 3-to-1 rule for multi-mic situations

Whenever you place more than one microphone on a source — two spaced mics, or a main pair plus spot mics — the time/phase difference between them creates comb filtering when their outputs are summed: a periodic series of peaks and notches whose first null sits at the frequency where the path difference equals half a wavelength. The classic guard against audible comb filtering is the 3-to-1 rule: the distance between any two microphones should be at least three times the distance from each microphone to its intended source. If a spot mic is $0.3\ \mathrm{m}$ from its instrument, keep it at least $0.9\ \mathrm{m}$ from the next mic. The rule works because tripling the path difference drops the leaked source roughly $9{-}10\ \mathrm{dB}$ relative to the direct sound, making the comb notches shallow enough to be inaudible.

\text{If } d_{\text{mic-to-source}} = r, \text{ then keep } d_{\text{mic-to-mic}} \ge 3r.

Comb filtering ruins spaced recordings

The number-one way to wreck a spaced-mic recording is to ignore path differences. Two omnis a metre apart summed to mono, or a spot mic too close to the main pair, produce comb filtering — a hollow, phasey, frequency-dependent coloration that you cannot fix later. Honour the 3-to-1 rule, check your mono fold-down during the session, and listen for the tell-tale "flange" before you commit.

7. Spaced arrays and the Decca-tree preview

The diffuse-centre weakness of two spaced omnis is solved by adding microphones, and the most celebrated solution is the Decca tree. Developed by Decca's engineers (Roy Wallace, Kenneth Wilkinson and colleagues) in the 1950s for orchestral recording, it is an L–C–R cluster of three omnidirectional microphones — historically Neumann M50s, omnis on a sphere with a gentle treble lift — mounted on a T-shaped bar. A typical geometry places the left and right mics about $2\ \mathrm{m}$ apart and the centre mic pushed forward, roughly $1{-}1.5\ \mathrm{m}$ ahead of the L–R axis, suspended high above and slightly behind the conductor.

The forward centre microphone is the key. It anchors central images that two flanking spaced omnis alone would leave vague, using the precedence effect: sound from the centre of the orchestra reaches the forward centre mic first, fixing soloists and the conductor's podium firmly in the middle, while the wide L and R omnis supply width, depth, and the enveloping hall sound. The result is the lush, deep, wide orchestral perspective that defined decades of classical recordings. The Decca tree is, in essence, a three-channel spaced array, and it is the natural bridge from two-mic stereo to multichannel capture. Its full treatment — including how it extends to five-channel and immersive layouts — belongs to Surround Recording; here it stands as proof that the spaced-AB family scales by adding time-of-arrival channels rather than by changing principle.

8. Choosing a technique by source and goal

With the families and their physics in hand, technique selection becomes a reasoned mapping from source, room, and intended use onto the trade map of Section 1.

Source / goal	Recommended technique	Why
Solo instrument, tight focus, mono-safe (broadcast)	XY or MS	Sharp centre, perfect mono fold
Orchestra in a great hall, max envelopment	Spaced AB / Decca tree	Big LF, enveloping room, depth
Chamber ensemble, natural single image	ORTF (near-coincident)	Balanced width, depth, good mono
Vocal/instrument needing post-adjustable width	MS	Width set after the session
Most accurate front-stage imaging in a good room	Blumlein	Near-linear angle mapping over ±45°
Drum overheads, mono-critical	XY or MS	No phase issues when summed
Ambience/room pair for later mixing	Wide spaced omnis	Decorrelated, spacious
Film/TV dialogue and effects	MS	Mono-compatible, steerable width

Three questions resolve most decisions. How wide is the source, and how must it map to the speakers? — set the SRA via family, pattern, angle, and spacing. How important is envelopment versus pin-point imaging? — slide along the coincident-to-spaced axis. Will this ever be heard in mono? — if yes, lean coincident (ideally MS); if never, spaced is free to be as wide as you like.

9. Worked example: predicting image position for XY, ORTF, and AB

Consider a single source at $\theta = 30^\circ$ off the centre axis, and predict where it images for three techniques. We use the same heuristics throughout: an image reaches the hard-left/right speaker at about $\Delta L \approx 16\ \mathrm{dB}$ or about $\Delta t \approx 1.1\ \mathrm{ms}$ , and the perceived position scales roughly linearly with the cue up to that limit.

XY (two cardioids, $90^\circ$ , $\alpha=45^\circ$ ). No time difference. Gains: $g_R = 0.5+0.5\cos(30^\circ-45^\circ) = 0.5+0.5\cos(-15^\circ) = 0.983$ ; $g_L = 0.5+0.5\cos(30^\circ+45^\circ) = 0.5+0.5\cos 75^\circ = 0.629$ . $\Delta L = 20\log_{10}(0.983/0.629) = 3.9\ \mathrm{dB}$ . As a fraction of the $\approx 16\ \mathrm{dB}$ full-scale, the image sits about $3.9/16 \approx 24\%$ of the way from centre to the right speaker — i.e. only modestly right of centre.

ORTF (two cardioids, 17 cm, $110^\circ$ , $\alpha=55^\circ$ ). Both cues act. Level: $g_R = 0.5+0.5\cos(30^\circ-55^\circ) = 0.5+0.5\cos(-25^\circ) = 0.953$ ; $g_L = 0.5+0.5\cos(30^\circ+55^\circ) = 0.5+0.5\cos 85^\circ = 0.544$ . $\Delta L = 20\log_{10}(0.953/0.544) = 4.9\ \mathrm{dB}$ . Time: with $d=0.17\ \mathrm{m}$ , $\Delta t = 0.17\sin 30^\circ/343 = 0.085/343 = 0.248\ \mathrm{ms}$ . So the right channel is both ~5 dB louder and ~0.25 ms earlier. The level cue alone gives $\approx 31\%$ of full deflection; the time cue ( $0.248/1.1 \approx 23\%$ ) reinforces it. The cues combine to place the source clearly to the right — noticeably further out than XY put it, and with a more "present" quality from the added time difference. ORTF's recording angle of $\sim \pm 48^\circ$ means a $30^\circ$ source images roughly $30/48 \approx 63\%$ of the way to the right speaker, consistent with the combined cues.

Spaced AB (two omnis, $d=0.5\ \mathrm{m}$ ). Essentially no level difference (omnis): $\Delta L \approx 0\ \mathrm{dB}$ . Time only: $\Delta t = 0.5\sin 30^\circ/343 = 0.25/343 = 0.73\ \mathrm{ms}$ . Against the $\approx 1.1\ \mathrm{ms}$ full-deflection figure that is about $66\%$ — the source images well to the right, further out than XY and comparable to or beyond ORTF, but with a softer, less precisely defined edge because the cue is purely temporal and frequency-dependent.

The three techniques thus place the same physical source at three different apparent positions — XY conservatively near centre, ORTF and AB further right — and with three different "feels": XY sharp and modest, ORTF tight and present, AB wide and diffuse. This single comparison crystallises the entire chapter: the technique you choose literally rewrites the geometry of the reproduced scene, because each encodes direction into a different cue at a different rate.

Same source, different scene

A microphone technique is a transfer function from physical angle to reproduced angle. Two pairs aimed identically at the same orchestra will hand the listener different stage widths, different image sharpness, and different envelopment. There is no "neutral" pair — only pairs whose mapping suits, or fails to suit, your source and intent.

10. Common mistakes and pitfalls

Comb filtering from spacing. Covered above and worth repeating: any two mics on one source comb-filter when summed. Honour the 3-to-1 rule, audition mono, and beware of wide spaced pairs in mono-critical chains.

Wrong SRA — bunched or stretched stage. If the ensemble images squeezed into the middle, the SRA is too wide: narrow it (more splay, tighter pattern, or move in). If the ensemble's edges pile up "in the speakers" with a hole in the middle, the SRA is too narrow: widen it (less splay, wider pattern, move back). Consult Williams' SRA tables rather than guessing; SRA depends jointly on pattern, angle, and spacing.

Off-axis colouration. Real directional microphones are only on-pattern near their axis; off-axis, especially above a few kHz, cardioids lose top end and become coloured. In coincident pairs the side of the stage is captured off-axis, so a $90^\circ$ XY can sound dull at the edges. Use microphones with well-behaved off-axis response, and don't splay wider than the capsule's off-axis quality supports.

The "hole in the middle". Spaced pairs (and over-wide near-coincident pairs) with weak central definition leave a gap between the speakers. Add a centre mic (toward the Decca tree), reduce spacing, or change technique.

Polarity and Side orientation in MS. If the figure-of-eight Side is flipped, left and right swap; if the matrix polarity is wrong, the image inverts. Always verify that a source moved physically to the left appears on the left after decoding, and label which lobe of the eight faces left.

Vertical non-coincidence in "coincident" pairs. XY and Blumlein capsules are stacked vertically, so they are coincident only in the horizontal plane; sources well above or below the array do incur small time differences. Keep the array's vertical axis sensible and don't expect perfect coincidence for height.

Ignoring the room. Especially with Blumlein (rear lobes) and spaced omnis (full room pickup), the room is part of the instrument. A technique that is glorious in a fine hall can be a liability in a dead or noisy space.

11. Limits of two-microphone stereo

Two channels, however cleverly encoded, have intrinsic limits that motivate the multichannel and Ambisonic methods of the chapters that follow.

Front-stage only. A stereo pair maps a roughly $±45{-}60^\circ$ frontal arc onto the speaker base. It cannot reproduce sources behind or to the side of the listener, nor true height — those require surround recording, immersive 3D recording, or ambisonic recording.
Sweet-spot dependence and the cue conflict. Phantom images are stable only near the symmetry line between the speakers; off-centre, the precedence effect collapses the image to the nearer speaker. And because capture techniques mix level and time cues that the playback geometry may not perfectly invert, the reproduced scene is always an approximation of the captured field, not a reconstruction of it — unlike the physically-grounded reconstruction aimed at by Wave Field Synthesis or high-order Ambisonics.
The accuracy/envelopment/mono trilemma. You cannot have maximal imaging accuracy, maximal envelopment, and perfect mono compatibility from one pair — the families exist precisely because improving one degrades another. Multichannel capture relaxes the trilemma by giving direction more channels to live in.
No true depth coding. Stereo conveys depth only indirectly, through level, the direct-to-reverberant ratio, and high-frequency air loss (see Distance and Air and Reverberation). It has no explicit distance channel.

These limits are not failures of craft; they are the boundary of the two-channel format itself. Everything that follows in Part IV is, in one way or another, an attempt to push past them while preserving the hard-won encode/decode insight of the stereo pair: direction becomes a level cue, a time cue, or both — and the choice is yours to make.

References

Eargle, J. (2004). The Microphone Book, 2nd ed. Focal Press. — Comprehensive treatment of polar patterns, coincident/near-coincident/spaced techniques, and MS matrixing.
Bartlett, B. & Bartlett, J. (2016). Practical Recording Techniques, 7th ed. Focal Press. — Practical setup, 3-to-1 rule, spaced and near-coincident arrays.
Williams, M. (1987). "Unified Theory of Microphone Systems for Stereophonic Sound Recording." AES 82nd Convention, preprint 2466. — The standard SRA framework relating pattern, angle, and spacing to recording angle.
Rumsey, F. (2001). Spatial Audio. Focal Press. — Spatial cues, capture/reproduction duality, and microphone-array theory.
Theile, G. (1991). "On the Performance of Two-Channel and Multi-Channel Stereophony." AES 88th Convention. — Association model of localization and the basis of near-coincident imaging.
Blumlein, A. D. (1931). British Patent 394,325 ("Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems"). — The founding patent of coincident stereophony and the Blumlein pair.
Streicher, R. & Dooley, W. (1985). "Basic Stereo Microphone Perspectives — A Review." Journal of the Audio Engineering Society, 33(7/8), 548–556. — Concise comparison of the major stereo techniques.
Howie, W., King, R. & Martin, D. (2016). Studies on main-microphone arrays for surround and 3D recording. AES Conventions. — Extension of stereo-array principles to surround and immersive capture.

← Back to Recording

1. The three families and their trade map​

2. Coincident XY: angled directional pairs​

How the patterns encode direction​

Worked example: SRA of a 90° cardioid XY​

Strengths and weaknesses​

3. Coincident Blumlein: crossed figure-of-eights​

The elegant symmetry​

Rear pickup and the room​

4. Mid/Side (MS): the matrixable coincident pair​

The matrix​

Continuously adjustable width​

Mono compatibility is automatic​

Worked example: width and the effective pattern​

Double M/S: extending the matrix to surround​

5. Near-coincident pairs: ORTF, NOS, DIN​

Why the small spacing matters​

The named setups​

6. Spaced AB: pure time-difference stereo​

The geometry of time-of-arrival stereo​

Why it is enveloping but imprecise​

The 3-to-1 rule for multi-mic situations​

7. Spaced arrays and the Decca-tree preview​

8. Choosing a technique by source and goal​

9. Worked example: predicting image position for XY, ORTF, and AB​

10. Common mistakes and pitfalls​

11. Limits of two-microphone stereo​

References​