Principles of Spatial Capture

Everything in this guide so far has been about the output side: how loudspeakers and headphones recreate direction, width, and depth, and how the ear decodes those cues. Part IV turns the chain around. Before a sound field can be reproduced, it has to be captured, and the way we place and combine microphones determines how much spatial information actually makes it onto the recording. This chapter lays the physical and perceptual groundwork that every later recording chapter builds on — stereo microphone techniques, surround recording, immersive 3D recording, and ambisonic recording.

The central idea is a duality. Reproduction systems encode a direction into a set of loudspeaker signals; the listener's ears decode it back into a perceived direction. Microphone arrays do exactly the same thing in reverse: they encode the direction of an incoming wavefront into differences between channels, so that the playback system can later decode those differences into a phantom image. Capture and reproduction are two halves of one transmission system, and they speak the same language — the language of interchannel level differences and interchannel time differences. Get the encoding wrong at the microphone and no amount of clever decoding downstream can recover the missing spatial cue.

What Spatial Capture Is Trying To Do

A real acoustic event — an orchestra in a hall, a quartet in a church, a thunderstorm over a valley — radiates a continuous, three-dimensional sound field. At any single point in space, that field collapses to a single time-varying pressure (a scalar) plus a particle-velocity vector. A microphone diaphragm samples some projection of this field at one location. The art of spatial capture is choosing where to sample, what directional weighting to apply, and how many samples to take, so that when the captured channels are played back the listener's auditory system reconstructs a convincing spatial scene.

What does "convincing" mean? Recall from psychoacoustics that the human auditory system localizes sound primarily through two binaural cues — the interaural time difference (ITD) and the interaural level difference (ILD) — supplemented by spectral (pinna) cues for elevation and front/back disambiguation. These are the duplex cues: ITD dominates below roughly 1.5 kHz, ILD above. A recording technique succeeds to the extent that, on playback, it delivers level and time differences to the listener's two ears that are consistent with a source at the intended direction.

This is the key point that ties the whole of Part IV together: a microphone array encodes a source direction as interchannel differences, and the playback system converts those interchannel differences into the interaural differences the ear actually reads. A coincident microphone pair, for example, encodes direction purely as a level difference between two channels; played back over a stereo pair, that interchannel level difference becomes a phantom image via summing localization (see stereo and amplitude panning). The microphone is, in effect, a panner driven by physics instead of by a knob.

Three things we want to capture

A complete spatial recording aims to preserve three perceptual attributes:

Direction — the apparent angle of each source (azimuth, and in 3D also elevation). Encoded as interchannel level and/or time differences.
Width / spaciousness — the apparent breadth of sources and of the ensemble (Apparent Source Width, ASW), driven mainly by early lateral reflections and by low interchannel correlation. See direct, diffuse and envelopment.
Depth / distance — the sense of how far away each source is, carried by the direct-to-reverberant ratio, early-reflection delay, and high-frequency air absorption.

Key takeaway

A microphone array is a spatial encoder. Its job is to translate the geometry of a real sound field into interchannel level and time differences that a reproduction system can decode back into ITD and ILD at the listener's ears. Everything in this part is a variation on that single idea.

Microphone Polar Patterns From First Principles

The directional behaviour of a microphone — its polar pattern — is the single most important property for spatial capture, because it controls how strongly a microphone weights sound arriving from each direction. Polar patterns arise from how the diaphragm is exposed to the sound field.

A pressure transducer has its diaphragm open on one side only and sealed on the other. It responds to the scalar sound pressure, which is the same from every direction (at least for wavelengths long compared with the capsule). The result is an omnidirectional pattern: equal sensitivity at all angles. A pressure-gradient transducer exposes both sides of the diaphragm to the field; the diaphragm responds to the difference in pressure across its two faces, which is proportional to the component of particle velocity along the capsule axis. Because particle velocity is a vector, the gradient response varies as $\cos\theta$ , where $\theta$ is the angle from the front axis. This is the figure-of-eight (bidirectional) pattern: maximum on axis front and rear, a deep null at $\pm 90^\circ$ , and opposite polarity front versus rear.

The first-order pattern equation

Every first-order microphone pattern is a weighted sum of a pressure (omni) component and a pressure-gradient (figure-of-eight) component. If we normalize so that the on-axis ( $\theta = 0$ ) sensitivity equals 1, the directional sensitivity is

p(\theta) = a + b\cos\theta, \qquad a + b = 1, \qquad 0 \le a \le 1.

Here $a$ is the omni (pressure) fraction and $b$ the figure-of-eight (gradient) fraction. Sliding $a$ from 1 down to 0 sweeps continuously through the entire family of first-order patterns:

$a = 1,\ b = 0$ → omnidirectional, $p(\theta) = 1$ .
$a = 0.5,\ b = 0.5$ → cardioid, $p(\theta) = 0.5 + 0.5\cos\theta = \cos^2(\theta/2)$ , with a single null at the rear ( $180^\circ$ ).
$a = 0,\ b = 1$ → figure-of-eight, $p(\theta) = \cos\theta$ , nulls at $\pm 90^\circ$ .

Between cardioid and figure-of-eight lie the supercardioid ( $a \approx 0.37$ ) and hypercardioid ( $a \approx 0.25$ ), which trade a small amount of rear pickup for a tighter front lobe and nulls moved off the rear axis. The angle of the null is found by setting $p(\theta_{\text{null}}) = 0$ :

\cos\theta_{\text{null}} = -\frac{a}{b} = -\frac{a}{1-a}.

For the hypercardioid ( $a = 0.25$ ), $\cos\theta_{\text{null}} = -1/3$ , so $\theta_{\text{null}} \approx 109.5^\circ$ — the deepest rejection sits about $70^\circ$ off the back, with a small rear lobe behind it.

Directivity factor and rejection

Two figures of merit summarise a pattern. The directivity factor $Q$ is the ratio of on-axis power sensitivity to the average over all directions; the directivity index is $\text{DI} = 10\log_{10} Q$ in dB. A higher $Q$ means a tighter pattern that picks up proportionally more direct sound relative to diffuse (reverberant) sound — which, as we will see, lets you sit the microphone farther from the source for a given direct-to-reverberant ratio. The random-energy efficiency (REE) $= 1/Q$ is the fraction of diffuse-field energy picked up relative to an omni.

Pattern	$a$	$b$	Null angle	Directivity factor $Q$	DI (dB)	Rear pickup (180°)
Omnidirectional	1.00	0.00	none	1.0	0.0	0 dB (full)
Wide cardioid	0.75	0.25	none	1.7	2.3	−6 dB
Cardioid	0.50	0.50	180°	3.0	4.8	−∞ (null)
Supercardioid	0.37	0.63	≈126°	3.7	5.7	−11.7 dB
Hypercardioid	0.25	0.75	≈110°	4.0	6.0	−6 dB
Figure-of-eight	0.00	1.00	±90°	3.0	4.8	0 dB (rear lobe)

Two patterns share the maximum first-order directivity factor of about 4: the hypercardioid has the highest front-to-total ratio, while the supercardioid has the best front-to-back ratio (most energy rejected from the rear hemisphere). The figure-of-eight and cardioid both have $Q = 3$ but distribute their rejection very differently — the cardioid kills the rear, the eight kills the sides.

These are the same directivity ideas that describe loudspeakers and instruments in source directivity: a polar pattern is a polar pattern whether the device radiates or receives. By reciprocity, a microphone's pickup pattern and the radiation pattern of the same transducer used as a source are identical.

Patterns drift with frequency

The clean $a + b\cos\theta$ model is an idealization. Real gradient capsules become more omnidirectional at low frequencies and beam (narrow) at high frequencies as the wavelength approaches the capsule diameter. A "cardioid" is genuinely cardioid only over a band in the midrange. This frequency-dependence of the polar pattern is a major reason coincident pairs can sound subtly different from their textbook predictions, and why off-axis coloration matters so much in spatial recording.

The Two Ways To Encode Direction

There are exactly two physical quantities a microphone array can use to register the direction of an incoming wavefront, and they map one-to-one onto the duplex cues of hearing.

Intensity (level) encoding. If two microphones sit at the same point but face different directions, a wavefront from angle $\theta$ reaches both capsules at the same instant — there is no time difference — but the two patterns weight it differently, producing an interchannel level difference (ICLD). This is the principle of coincident (intensity) techniques such as XY and Blumlein. Direction is encoded entirely as level. On playback over loudspeakers, the level difference becomes a phantom image through summing localization, which is essentially an ILD-like cue.

Time encoding. If two microphones are spaced apart, a wavefront from an off-axis angle reaches the nearer microphone first, producing an interchannel time difference (ICTD) — and, if the microphones are directional, a level difference too. This is the principle of spaced (time-difference) techniques such as a spaced omni pair (AB). On playback, the interchannel time difference translates (imperfectly) into an ITD-like cue, and also generates strong low-frequency decorrelation that the ear reads as spaciousness.

The deep symmetry is this: the ear localizes with ITD + ILD; microphone arrays encode with ICTD + ICLD. Coincident techniques use only ICLD; spaced techniques use mostly ICTD (plus some ICLD if directional). Most real-world setups live somewhere on the continuum between these poles.

Near-coincident: the practical middle

Near-coincident techniques place two directional microphones a small distance apart (a few centimetres up to ~30 cm) and angle them apart. They therefore generate both an ICLD (from the angling and pattern) and an ICTD (from the spacing) for off-axis sources. The two cues reinforce each other, giving a more stable, naturally wide image than either pure approach. The classic near-coincident pairs — ORTF (two cardioids, 17 cm apart, splayed 110°), NOS, DIN, Faulkner — are essentially engineered points on the intensity-time continuum, tuned so that the combined ICLD and ICTD steer the image the way the ear expects. We dissect them in stereo microphone techniques.

Class	Spacing	Direction cue	Mono fold-down	Spaciousness	Image stability
Coincident (XY, Blumlein, MS)	0 cm	Level only (ICLD)	Excellent	Modest	Very high, precise
Near-coincident (ORTF, NOS, DIN)	5–30 cm	Level + time	Good	Good	High
Spaced (AB omni)	40–300+ cm	Time-dominant (ICTD)	Poor (comb risk)	Very high, enveloping	Looser, "billowy"

Mapping cues to playback

This same intensity-versus-time split runs through the reproduction chapters. Amplitude panning (amplitude panning) is pure level-difference encoding — the playback dual of a coincident microphone. Time-delay and decorrelation methods are the dual of spaced microphones. Capture and playback are mirror images of one encode/decode chain.

Coincidence and Why It Matters

A pair is coincident when both capsules occupy the same point in space, so that any wavefront arrives at both simultaneously. In practice this means stacking two capsules vertically as close as mechanically possible, one directly above the other, so that horizontal arrival times are identical.

Coincidence has one enormous consequence: because the two channels carry no time difference, summing them to mono produces no comb filtering. The signals are time-aligned, so $L + R$ simply reinforces; the tonal balance in mono is essentially the same as in stereo. This is why coincident techniques are described as mono-compatible, and why they were historically favoured for broadcast, where a large fraction of listeners heard a mono fold-down.

The price of coincidence is reduced spaciousness. With no time difference and highly correlated low-frequency content, a coincident pair produces a tighter, more "in-the-box" image with less of the enveloping, decorrelated low end that makes spaced arrays sound expansive. The image is precise and stable but can feel narrow and acoustically "dry" relative to the room. Coincident arrays also pick up less of the uncorrelated diffuse field that the ear reads as envelopment.

The Blumlein pair as the limiting case

The purest intensity technique is the Blumlein pair: two figure-of-eight microphones coincident and crossed at $90^\circ$ , so each points $\pm 45^\circ$ from centre. Because $p(\theta) = \cos\theta$ for an eight, a source at angle $\phi$ from the centre line produces channel sensitivities $\cos(45^\circ - \phi)$ and $\cos(45^\circ + \phi)$ , an essentially sinusoidal level law that maps the front $\pm 45^\circ$ stage onto the full stereo width with excellent linearity. Alan Blumlein patented this and the entire stereo concept in 1931 — the foundational moment of spatial recording (British Patent 394,325). The rear quadrant is also captured, with reversed left/right, picking up the room behind the array; this is part of the Blumlein "sound" but demands a good room.

Spaced pairs, by contrast, fold down less cleanly because their interchannel time differences become comb-filter notches when summed — the subject of a later section. The trade is fundamental: coincidence buys mono compatibility at the cost of envelopment; spacing buys envelopment at the cost of mono compatibility.

Rule of thumb

If your delivery includes a mono fold-down (broadcast, club PA, mobile speakers), bias toward coincident or near-coincident techniques. If the deliverable is strictly multichannel for attentive listening and you want maximum envelopment, spacing earns its keep — just monitor the mono sum so you know what you are risking.

Path-Difference Geometry for Spaced Microphones

When microphones are spaced, geometry sets the interchannel time difference. Consider two microphones separated by a distance $d$ along a baseline. A plane wavefront (a distant source, so the rays are parallel) arrives from an angle $\theta$ measured from the perpendicular bisector of the baseline (so $\theta = 0$ is dead centre, broadside to the array). The wavefront reaches the nearer microphone first; it must then travel an extra path

\Delta d = d\sin\theta

to reach the farther microphone. Dividing by the speed of sound $c \approx 343\ \text{m/s}$ (at 20 °C) gives the interchannel time difference

\Delta t = \frac{\Delta d}{c} = \frac{d\sin\theta}{c}.

This is the spaced-microphone analogue of the geometry the head itself imposes to create ITD — the baseline $d$ plays the role of the interaural distance. The maximum time difference, at $\theta = 90^\circ$ (source in line with the baseline), is $\Delta t_{\max} = d/c$ .

Worked example: 30 cm spacing at 45°

Take a spaced pair with $d = 0.30\ \text{m}$ and a source at $\theta = 45^\circ$ :

\Delta d = 0.30 \times \sin 45^\circ = 0.30 \times 0.707 = 0.212\ \text{m},

\Delta t = \frac{0.212}{343} = 6.18 \times 10^{-4}\ \text{s} = 0.62\ \text{ms} = 618\ \mu\text{s}.

For comparison, the human head produces a maximum ITD of roughly $650\ \mu\text{s}$ . So a 30 cm spaced pair at $45^\circ$ generates an interchannel time difference comparable to a full natural ITD — which is exactly why even a modest spacing creates a vivid, enveloping sense of stereo width once those channels feed two loudspeakers. The catch, addressed below, is that the same $0.62\ \text{ms}$ delay becomes a comb filter the instant the two channels are summed.

From time difference back to perceived position

On playback over two loudspeakers, an interchannel time difference is perceived through the precedence effect and summing localization: differences up to roughly $1$ – $1.5\ \text{ms}$ pull the phantom image toward the earlier loudspeaker, with the image reaching the speaker at around $1.0$ – $1.1\ \text{ms}$ for many programme types. Beyond that the precedence effect locks the image hard to the leading side. This non-linear, frequency-dependent mapping is why spaced-pair imaging is looser and harder to predict than the clean level-law mapping of a coincident pair — but it is also why spaced arrays sound "big". The relationship between $\Delta t$ and perceived angle is the capture-side counterpart of the playback localization curves discussed in stereo.

Recording Angle and the Stereo Recording Angle (SRA)

A microphone array does not map the entire surrounding scene onto the full stereo image. Instead, there is a specific range of source angles — the Stereo Recording Angle (SRA), also called the recording angle or acceptance angle — that is mapped onto the full distance between the two loudspeakers. Sources inside the SRA appear within the stereo stage; a source exactly at the edge of the SRA appears at a loudspeaker; sources outside the SRA collapse into (or beyond) the nearest speaker and lose positional information.

The SRA is the capture-side dual of the loudspeaker base angle. If your SRA is $\pm 50^\circ$ ( $100^\circ$ total) and you point that array at an orchestra that subtends $100^\circ$ from the microphone position, the orchestra will just fill the loudspeakers — left edge at the left speaker, right edge at the right. If the SRA is narrower than the ensemble, the outer players pile up at the speakers and the image is exaggerated and unstable. If the SRA is wider than the ensemble, the orchestra shrinks into a narrow patch between the speakers, leaving "holes" of empty stage at the sides.

What sets the SRA

The recording angle is determined by how quickly the array's interchannel differences grow as a source moves off-centre — i.e. by the combination of polar pattern, mutual angle, and spacing:

For a coincident pair, the SRA depends on the pattern and the included angle. Widening the angle between two cardioids, or using a tighter pattern (hyper instead of cardioid), makes the ICLD grow faster with source angle and therefore narrows the SRA. The Blumlein pair has an SRA of about $\pm 45^\circ$ .
For a spaced/near-coincident pair, increasing the spacing makes the ICTD grow faster and narrows the SRA, while increasing the splay angle adds ICLD and narrows it further. ORTF's 17 cm / 110° geometry yields an SRA of roughly $\pm 90^\circ$ (about $180^\circ$ total — though commonly quoted nearer $96^\circ$ total for a tightly defined image), well suited to a wide ensemble in a smaller room.

Michael Williams systematized this in his work on microphone array analysis, publishing curves and tables ("Microphone Arrays for Stereo and Multichannel Sound Recording") that let an engineer choose spacing and angle to hit a target SRA. The practical recipe: measure (or estimate) the angular width of your source as seen from the planned microphone position, then choose an array whose SRA matches it.

Worked example: matching SRA to an ensemble

A chamber orchestra is $9\ \text{m}$ wide. You plan to place the main pair $6\ \text{m}$ back, on the centre line. The half-angle subtended is

\theta_{1/2} = \arctan\!\left(\frac{9/2}{6}\right) = \arctan(0.75) = 36.9^\circ,

so the ensemble subtends about $\pm 37^\circ$ , i.e. $74^\circ$ total, from the array. To fill the loudspeakers without exaggeration you want an array with an SRA near $74^\circ$ . A Blumlein pair (SRA $\approx 90^\circ$ ) would leave the orchestra slightly narrow; an ORTF-style near-coincident pair adjusted to a $\sim 75^\circ$ SRA (a little more spacing or splay than standard ORTF) would map the ensemble neatly across the stage. If you instead moved closer — say $4\ \text{m}$ back — the subtended angle jumps to $\pm 48^\circ$ ( $96^\circ$ total), now demanding a narrower-SRA array, and you would also pick up far more direct sound relative to the room (next section).

Setting up by SRA, not by habit

Do not memorise "ORTF = good." Memorise: find the angular width of the source from the mic position, then pick the array whose SRA matches it. The same physical array gives a different image at $4\ \text{m}$ than at $8\ \text{m}$ because the source subtends a different angle. SRA matching is the single most reliable predictor of whether your stereo image will be correctly proportioned.

Distance, Direct-to-Reverberant Ratio, and Mic Placement

Where you place the array along the front-to-back axis controls the direct-to-reverberant ratio (DRR), which the ear reads as distance and depth. This is the spatial-capture application of distance and air absorption and reverberation.

In a room, the direct sound from a source obeys the inverse-square law: its level falls by $6\ \text{dB}$ per doubling of distance, with pressure $\propto 1/r$ . The reverberant (diffuse) field, by contrast, is roughly uniform throughout the room. There is therefore a distance at which the two are equal — the critical distance (or reverberation radius) $r_c$ :

r_c \approx 0.057\,\sqrt{\frac{Q\,V}{T_{60}}}\quad(\text{metres, with }V\text{ in }\text{m}^3,\ T_{60}\text{ in s}),

where $Q$ is the source directivity factor, $V$ the room volume, and $T_{60}$ the reverberation time. Inside $r_c$ the direct sound dominates (close, dry, detailed); outside $r_c$ the reverberant field dominates (distant, washed, enveloping). The main pair almost always belongs near or just inside the critical distance, where the DRR gives a natural sense of the room without burying the sources.

Worked example: where is critical distance?

A concert hall has $V = 15\,000\ \text{m}^3$ and $T_{60} = 2.0\ \text{s}$ . Treating the ensemble as roughly omnidirectional radiators ( $Q \approx 1$ ):

r_c \approx 0.057\sqrt{\frac{1 \times 15{,}000}{2.0}} = 0.057\sqrt{7500} = 0.057 \times 86.6 \approx 4.9\ \text{m}.

So the critical distance is about $5\ \text{m}$ . A main pair placed at $4$ – $6\ \text{m}$ will sit right around the balance point — a classic, natural orchestral perspective. Push the pair back to $12\ \text{m}$ and you are well into the reverberant field: distant, diffuse, and likely too wet. Pull it in to $2\ \text{m}$ and you get a dry, close, detailed but spatially flat image with little hall. Note that a directional source (a brass section blasting forward, $Q \approx 5$ ) more than doubles its own critical distance, which is why solo instruments and the main pickup can need very different distances even in the same room.

The two-microphone reach of directivity

Because the critical-distance formula scales with the array pattern's reach too, a more directional main microphone effectively lets you sit farther from the source for the same DRR. A pair of cardioids ( $Q = 3$ ) reaches roughly $\sqrt{3} \approx 1.7\times$ farther than omnis for the same direct/diffuse balance, which is precisely why cardioid main pairs are common in reverberant churches and omni pairs in drier studios. The pattern choice is thus simultaneously a directional-encoding choice and a depth/distance choice — the two cannot be decoupled.

Don't fix distance with faders

The direct-to-reverberant ratio is baked in at the moment of capture by where the microphone sits. You cannot meaningfully add "closeness" later by turning a fader up — you only make the whole (direct + reverberant) signal louder, leaving the ratio, and thus the perceived distance, unchanged. Get the array distance right on the day. This is one reason spot/accent microphones exist: to re-inject direct detail that a distant main pair cannot, see surround recording.

Mono Compatibility and the Comb-Filter Trap

We now return to the great weakness of spaced techniques: what happens when their channels are summed. Whenever two copies of the same signal are added with a time delay $\Delta t$ between them, the sum exhibits comb filtering — a series of evenly spaced cancellation notches and reinforcement peaks across the spectrum. This happens on any mono fold-down, but also subtly within a multichannel mix wherever two channels carrying delayed copies of the same source overlap.

The physics: a delay $\Delta t$ corresponds to a phase shift $2\pi f \Delta t$ that grows with frequency. At frequencies where the delay equals a half-period (or odd multiple thereof), the two copies are in anti-phase and cancel. The first notch sits at

f_{\text{notch},1} = \frac{1}{2\,\Delta t} = \frac{c}{2\,\Delta d},

with further notches at every odd multiple, $f_n = (2n-1)/(2\Delta t)$ , and reinforcement peaks (combs of $+6\ \text{dB}$ ) at integer multiples $f = n/\Delta t$ . The spacing between successive notches is $1/\Delta t$ . The deeper the two signals' correlation and the closer their levels, the deeper the notches — full cancellation for equal-level copies.

Worked example: comb notches from a 30 cm pair

Recall the 30 cm spaced pair with a source at $45^\circ$ : $\Delta d = 0.212\ \text{m}$ , $\Delta t = 0.62\ \text{ms}$ . Summed to mono, the first cancellation lands at

f_{\text{notch},1} = \frac{c}{2\,\Delta d} = \frac{343}{2 \times 0.212} = \frac{343}{0.424} \approx 809\ \text{Hz},

with further notches at $\approx 2.43\ \text{kHz}, 4.05\ \text{kHz}, 5.66\ \text{kHz}, \dots$ — a notch every $1/\Delta t \approx 1.62\ \text{kHz}$ . An 809 Hz notch sits squarely in the most audible part of the spectrum; in mono this off-centre source would sound hollow and phasey. A coincident pair, with $\Delta t = 0$ , pushes that first notch to infinity — there is no notch at all, which is the mathematical statement of "mono-compatible".

This is why mono compatibility and spaciousness pull in opposite directions. The same interchannel time differences that decorrelate the low end and create envelopment in stereo become comb filters on summation. There is no free lunch: you choose where on the continuum to sit based on the deliverable.

Always monitor the mono sum

The single most common way to ruin an otherwise beautiful spaced recording is to never check it in mono. A recording that is gorgeous and wide on a calibrated stereo system can turn thin, hollow, and phasey the moment a broadcaster, a club, or a phone sums it. Make pressing the mono button a reflex during setup, before the take, and adjust spacing or move toward a near-coincident geometry if the sum collapses.

Interchannel correlation as a meter

A practical way to quantify all of this is the interchannel cross-correlation coefficient, ranging from $+1$ (identical channels, mono, no width) through $0$ (uncorrelated, maximally spacious) to $-1$ (anti-phase, which cancels in mono). Coincident pairs sit near high positive correlation (safe in mono, narrower); spaced pairs drive correlation toward zero at low frequencies (wide and enveloping, risky in mono). A correlation meter on the main pair is the engineer's early-warning system for both envelopment and mono trouble. The same correlation logic governs decorrelation in transaural and binaural rendering on the playback side.

A Short History of the Two Schools

The intensity-versus-time split is not just a textbook taxonomy; it is a genuine historical rivalry. In 1931, Alan Blumlein at EMI patented stereophony, the figure-of-eight crossed pair, and the "shuffling" circuit that converts spaced-omni signals into level differences — the entire intensity school in one document. He reasoned, correctly, that for loudspeaker reproduction a level difference between channels would translate into the ITD/ILD the ears need, and that a coincident pair would be inherently mono-compatible.

In parallel, Bell Labs (Steinberg and Snow, 1934) pursued the spaced "wall of microphones" approach — an array of spaced pickups feeding a matching array of loudspeakers, conceptually a sampled wavefront and the spiritual ancestor of Wave Field Synthesis. Spaced-omni recording grew from this lineage and dominated American classical recording (the Mercury "Living Presence" and RCA "Living Stereo" three-mic Decca/spaced approaches) for decades, prized for its airy, enveloping sound.

The near-coincident compromise was codified in 1960 by French and Dutch broadcasters: ORTF (Office de Radiodiffusion-Télévision Française) and NOS (Nederlandse Omroep Stichting) each fixed a spacing-plus-angle geometry to get the best of both worlds for radio, where some mono listeners remained. Michael Gerzon's Ambisonics (1970s) then reframed coincident capture as the sampling of a sound field's spherical-harmonic components — the subject of ambisonic recording — generalizing the Blumlein idea to full 3D. The arc from 1931 to today is essentially a continuing negotiation between the level-difference and time-difference schools, now extended into surround and 3D.

Putting It Together: Practical Workflow Basics

The principles above translate into a short, repeatable setup procedure that the later chapters elaborate technique by technique.

A setup checklist

Identify the deliverable and its fold-downs. Stereo only? Surround? Will anyone hear mono? This decides how far toward the spaced end of the continuum you can safely go.
Measure the source's angular width from candidate microphone positions, and estimate the critical distance from the room volume and reverberation time. These two numbers fix your distance and your target SRA.
Choose pattern + geometry to hit the SRA and the desired direct-to-reverberant ratio. Tighter patterns and more distance trade detail for room; spacing and splay trade mono safety for envelopment.
Place the main pair, then audition — in stereo and in mono. Walk the correlation meter. Move the array a few tens of centimetres and listen to how depth and width shift; small moves matter near the critical distance.
Add spot/accent and room/ambience microphones only after the main pair is right. They refine balance and re-inject detail or envelopment; they cannot rescue a badly placed main pair. Time-align spots to the main pair to avoid introducing new comb filtering.
Document everything — positions, patterns, angles, distances — so a good result is repeatable and a bad one is diagnosable.

What the later chapters build on

Everything here is foundational scaffolding:

Stereo microphone techniques applies the continuum, SRA, and mono-compatibility ideas to specific two-microphone arrays (XY, Blumlein, MS, ORTF, NOS, spaced AB, Decca tree).
Surround recording extends the front pickup to a rear/ambience array and the spot-microphone logic to a full 5.1/7.1 stage.
Immersive 3D recording adds the height dimension, where decorrelation and pattern choice for the upper layer dominate.
Ambisonic recording recasts coincident capture as spherical-harmonic sampling, a fully steerable, format-agnostic encoding of the whole sound field.

Across all of them the through-line never changes: a microphone array is a spatial encoder whose interchannel level and time differences are the capture-side mirror of the ITD and ILD your ears decode — the same duality that runs through psychoacoustics, the reproduction techniques, and the field-and-room behaviour of direct, diffuse, and envelopment.

The one sentence to remember

Coincident pairs encode direction as level differences and fold down to mono perfectly but sound tighter; spaced pairs encode direction as time differences and sound enveloping but comb-filter in mono; near-coincident pairs blend both. Choose your position on that continuum from the deliverable, the room's critical distance, and the source's angular width.

Limits of This Model

The first-principles picture in this chapter is powerful but idealized, and a working engineer should know where it stops:

Plane-wave assumption. The $\Delta t = d\sin\theta / c$ geometry assumes a distant source so the wavefronts are parallel. For close sources the wavefronts are curved and the near microphone also sees a higher level (proximity-like effects, inverse-square within the array), shifting the encoding partly back toward intensity.
Frequency-dependent patterns. As warned earlier, real capsules are not textbook $a + b\cos\theta$ across the whole band, so SRA and rejection vary with frequency.
Diffuse-field complexity. The clean direct/reverberant split ignores discrete early reflections, which are not diffuse and which dominate Apparent Source Width. Two positions with the same DRR can sound very different because of their early-reflection patterns.
Localization is non-linear. The translation from interchannel differences to perceived angle (the precedence effect, summing localization) is non-linear and frequency-dependent, so an array tuned by geometry still needs to be confirmed by listening.
Two ears, not two channels. Loudspeaker playback delivers both channels to both ears with crosstalk, so the interchannel-to-interaural mapping is itself approximate — the reason binaural and transaural (binaural, transaural) take a different route entirely.

None of these invalidate the framework; they mark the boundary where experience and monitoring take over from calculation. The equations get you to a strong starting position; your ears finish the job.

References

Rumsey, F. Spatial Audio. Focal Press, 2001. (Foundational treatment of spatial cues, microphone techniques, and the intensity/time duality.)
Eargle, J. The Microphone Book, 3rd ed. Focal Press, 2012. (Polar patterns from first principles, coincident and spaced techniques, directivity factor and critical distance.)
Bartlett, B., and Bartlett, J. Practical Recording Techniques, 7th ed. Routledge, 2016. (Hands-on stereo and surround microphone setup, mono compatibility, spot-microphone workflow.)
Williams, M. "Unified Theory of Microphone Systems for Stereophonic Sound Recording." AES 82nd Convention, preprint 2466, 1987. (Systematic derivation of the Stereo Recording Angle from pattern, spacing, and angle.)
Williams, M. Microphone Arrays for Stereo and Multichannel Sound Recording. Editrice Il Rostro, 2004. (SRA tables and array design for stereo and surround.)
Blumlein, A. D. "Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems." British Patent 394,325, 1931. (The original stereophony and coincident-pair patent.)
Theile, G. "On the Naturalness of Two-Channel Stereo Sound." Journal of the Audio Engineering Society, vol. 39, 1991. (Association model of localization; basis for near-coincident array design.)
Howie, W., King, R., Martin, D., and Grond, F. "Subjective and Objective Evaluation of 9-Channel Surround Recording Techniques." Journal of the Audio Engineering Society, vol. 67, 2019. (Modern evaluation of immersive main-array techniques and decorrelation.)

← Back to Recording

What Spatial Capture Is Trying To Do​

Three things we want to capture​

Microphone Polar Patterns From First Principles​

The first-order pattern equation​

Directivity factor and rejection​

The Two Ways To Encode Direction​

Near-coincident: the practical middle​

Coincidence and Why It Matters​

The Blumlein pair as the limiting case​

Path-Difference Geometry for Spaced Microphones​

Worked example: 30 cm spacing at 45°​

From time difference back to perceived position​

Recording Angle and the Stereo Recording Angle (SRA)​

What sets the SRA​

Worked example: matching SRA to an ensemble​

Distance, Direct-to-Reverberant Ratio, and Mic Placement​

Worked example: where is critical distance?​

The two-microphone reach of directivity​

Mono Compatibility and the Comb-Filter Trap​

Worked example: comb notches from a 30 cm pair​

Interchannel correlation as a meter​

A Short History of the Two Schools​

Putting It Together: Practical Workflow Basics​

A setup checklist​

What the later chapters build on​

Limits of This Model​

References​