Immersive & 3D Recording

Every technique covered so far in Recording & Capture has lived in a horizontal plane. The stereo pair places phantom images on a frontal arc; surround recording wraps that arc into a full $360^\circ$ ring around the listener. Both encode a single angle: the azimuth $\theta$ of each source relative to the array. Immersive, or "3D", recording adds a second angle — elevation $\phi$ — so that the captured scene fills a hemisphere (and, in principle, a sphere) rather than a disc. This is the input-side dual of the height channels you met in immersive formats and in the object-based and channel-based playback layouts of Techniques. If 7.1.4 reproduction can place a helicopter overhead, then 3D capture must furnish the four overhead signals that put it there.

This chapter builds the height axis from first principles. We start with why height is perceptually different from azimuth — and harder to capture — then develop the dominant practical paradigm (a horizontal main array plus a separate height layer), survey the named 3D arrays in the literature, and confront the central engineering problem of immersive capture: keeping the height layer decorrelated from the ear-level bed so the ceiling does not collapse onto the floor. Throughout, keep the recurring duality of Principles of Spatial Capture in mind: coincident arrays encode direction as level differences, spaced arrays encode it as time differences, and both are simply the recording-side mirror of the ITD and ILD cues your ears decode (see psychoacoustics).

The New Axis: Adding Height to Capture

From a ring to a hemisphere

A horizontal surround array samples the wavefield on a circle in the listener's ear plane. Mathematically, each microphone capsule has a directional response $g(\theta)$ and a position on that circle; the ensemble of capsules samples incoming plane waves as a function of azimuth only. A source at $30^\circ$ left and a source $30^\circ$ left and $30^\circ$ up produce, in such an array, very nearly the same set of inter-channel differences. The array is blind to elevation because none of its sampling baselines have a vertical component.

To capture elevation you must introduce vertical structure: either capsules physically raised above the main layer (the layer-based approach), or capsules whose directivity discriminates between up and down (the higher-order Ambisonic approach, treated in its own chapter). In channel-based immersive recording the dominant method is the former — a height layer of microphones mounted above the main layer that already does the horizontal job.

The target playback layouts make the requirement concrete. A 7.1.4 bed needs four height channels (front-left/right and rear-left/right, the " $.4$ "); a 9.1.6 or 22.2 bed needs six or nine. The recording array's job is to feed those channels signals that (a) carry the correct elevated sources, and (b) remain different enough from the ear-level channels that the listener perceives a vertical extent rather than a louder version of the floor.

Why height is special

There are three reasons the height axis cannot be treated as "just another pair of channels":

No physical baseline for time-of-arrival in the median plane. For azimuth, your two ears straddle the head, giving an interaural baseline of roughly $0.18\ \mathrm{m}$ and a usable ITD. For elevation in the median plane, both ears are equidistant from the source: $\mathrm{ITD}\approx 0$ and $\mathrm{ILD}\approx 0$ . The ear must fall back on spectral cues (pinna filtering), which are weaker, more individual, and more easily masked.
Asymmetric content. Real scenes are bottom-heavy: most direct sound — instruments, voices, footsteps — originates near or below ear level, while the upper hemisphere is dominated by reflections and reverberation. Height capture is therefore mostly about ambience and early reflections, not hard-localized direct sources. This shapes every design decision below.
Fragile cues. Because the height percept rests on subtle spectral and decorrelation cues, it is easily destroyed by the very thing channel-based recording is prone to: correlation between layers. Add a height channel that is a delayed copy of the main channel and the auditory system fuses them by the precedence effect into a single image at ear level. The height vanishes.

Key takeaway

Horizontal capture is a problem of baselines and directivity; height capture is a problem of decorrelation and spectral content. The two layers are engineered to do different jobs, which is why immersive arrays are not simply "surround arrays stacked twice."

The Perceptual Basis: How Height Is Heard

Spectral cues and the pinna

Localization in the median plane is governed largely by direction-dependent spectral filtering by the pinna, head, and torso — the elevation-dependent part of the HRTF. As a source rises, the pinna's folds introduce notches and peaks in the high-frequency spectrum (roughly $5$ – $16\ \mathrm{kHz}$ ) that the brain has learned to read as elevation. The most studied feature is the pinna notch, a spectral null whose center frequency $f_n$ sweeps upward as the source rises from below to above. A useful descriptive relation models the notch as arising from a path-length difference $\Delta\ell(\phi)$ between the direct sound and a pinna reflection:

f_n(\phi) \approx \frac{c}{2\,\Delta\ell(\phi)},

with $c \approx 343\ \mathrm{m/s}$ . For $\Delta\ell$ on the order of a centimeter the first notch sits near $f_n \approx \frac{343}{2\times 0.01} \approx 17\ \mathrm{kHz}$ ; as the path lengthens at lower elevations the notch drops toward $6$ – $8\ \mathrm{kHz}$ . These are fragile, high-frequency, listener-specific features — which is exactly why height is the most individually variable of all spatial cues and why microphone capture of height is so dependent on preserving high-frequency content and decorrelation.

A second important cue is the directional band effect (Blauert's "boosted bands"): broadband or noisy sources are perceived overhead when energy is concentrated near $8\ \mathrm{kHz}$ , in front when near $400\ \mathrm{Hz}$ and $4\ \mathrm{kHz}$ , and behind near $1\ \mathrm{kHz}$ . Height capture therefore lives and dies in the upper octaves: a height array that is dull or band-limited delivers little elevation regardless of its geometry.

Vertical interchannel cues are not the same as horizontal ones

Here is the crucial asymmetry for recording. In the horizontal plane, two loudspeakers reproduce a stable phantom image between them via summing localization, because the ear's ITD/ILD machinery is built for left–right differences. In the vertical plane, two loudspeakers (one at ear level, one above) do not form a clean phantom partway up. Instead, the perceived elevation is dominated by a phenomenon Hyunkook Lee has characterized as the Vertical Image Spread and phantom-image elevation, where the spectral content of the signal — not a simple amplitude ratio between the two speakers — controls perceived height. Panning a source between an ear-level and a height speaker does not move it smoothly upward the way left–right panning moves it sideways; the result is more an increase in vertical spread than a precisely placed elevated image.

The recording consequence is profound: you cannot "pan with height" by feeding correlated, level-scaled copies to the height layer. The height layer must carry genuinely different signal — different reflections, different reverberant build-up, decorrelated noise — for the auditory system to register elevation. This is why the engineering problem of 3D capture reduces, again and again, to managing vertical interchannel coherence.

Why this matters for microphone choice

Because elevation cues live above $\sim 5\ \mathrm{kHz}$ and depend on decorrelation, the height layer rewards microphones with extended, smooth top end and rewards spacing (which decorrelates via time differences) more than it rewards tight coincidence (which preserves correlation). This single fact explains most named-array geometries below.

Layer-Based Capture: A Main Array Plus a Height Layer

The dominant paradigm

The workhorse of channel-based immersive recording is the two-layer array: a complete horizontal main array (a surround array such as OCT, the Decca tree, or a Hamasaki square) feeds the ear-level bed, and a separate set of upward- or forward-facing microphones mounted above it feeds the height layer. The two layers are recorded simultaneously through the same air, so they share the room's reverberant tail and large-scale dynamics, but their geometry is chosen so that the height capsules collect a different mixture of direct sound and reflections.

Two sub-strategies exist for orienting the height capsules:

Upward-facing height mics. Cardioids or supercardioids aimed at the ceiling collect predominantly the upper hemisphere — ceiling reflections and the reverberant field — and strongly reject the direct sound from the stage. This maximizes decorrelation from the main layer (which is aimed at the stage) and is favored for concert-hall and ambience capture.
Forward-facing height mics. Cardioids aimed at the stage, mounted above the main pair, capture the same direct sources but from a different vertical perspective. This keeps the height layer coherent enough to "lift" frontal sources slightly, at the risk of too much correlation (and comb filtering) with the main layer.

In practice many designs split the difference: front height mics tilted partly upward, rear height mics aimed fully up at the hall.

Spacing the height layer above the main pair

The single most important parameter is the vertical spacing $h$ between a height capsule and the main capsule directly below it. This spacing converts elevation into an inter-layer time difference — exactly the spaced-array logic of Principles of Spatial Capture, now rotated into the vertical plane. For a source at elevation $\phi$ arriving at a vertical pair separated by $h$ , the path difference between the lower and upper capsule is

\Delta d = h\,\sin\phi, \qquad \Delta t = \frac{h\,\sin\phi}{c}.

Choosing $h$ is a balance. Too small, and the two layers are nearly coincident: $\Delta t \to 0$ , the height channel is a near-copy of the main channel, coherence is high, and the height collapses by precedence. Too large, and $\Delta t$ grows past the fusion limit, producing audible comb filtering for direct sound (when the layers are correlated) and a smeared, doubled image. The literature converges on roughly $h \approx 0.7$ – $1.5\ \mathrm{m}$ for main arrays, with $1.0\ \mathrm{m}$ a common default.

Worked example — choosing the height spacing. Take $h = 1.0\ \mathrm{m}$ and a ceiling reflection arriving at $\phi = 45^\circ$ . The path difference is

\Delta d = 1.0 \times \sin 45^\circ = 0.707\ \mathrm{m}, \quad \Delta t = \frac{0.707}{343} \approx 2.06\ \mathrm{ms}.

A $2\ \mathrm{ms}$ offset is comfortably inside the precedence-effect fusion window ( $\sim 1$ – $30\ \mathrm{ms}$ ) for reflections, so the height capsule's signal is heard as part of the same event but spatially spread upward — desirable. For a direct frontal source at $\phi = 0^\circ$ , however, $\Delta d = h\sin 0^\circ = 0$ : a forward-facing height mic at the same horizontal position as the main mic would receive the direct sound with no time offset, hence maximal correlation. This is precisely why height mics are tilted up or positioned to look away from the direct source — to break that zero-elevation correlation.

Rule of thumb

Mount the height layer about $1\ \mathrm{m}$ above the main layer and aim the height capsules away from the principal direct sources (toward the ceiling or the hall). You want the height channels rich in reflections and poor in direct sound — that is what reads as "above" without comb filtering.

Named 3D Arrays

Several named arrays formalize the layer-based idea with specific capsule types, angles, and spacings. They differ in how the main layer is built (coincident vs. spaced), how many height channels they target, and how the height layer is decorrelated.

PCMA-3D

The Perspective Control Microphone Array – 3D, developed by Hyunkook Lee, is built explicitly around vertical decorrelation research. Its main layer is a near-coincident arrangement of four supercardioids (front-left, front-right, rear-left, rear-right) whose look directions and small spacings are tuned to a target horizontal recording angle. Above each main capsule sits a vertically stacked supercardioid, typically $1\ \mathrm{m}$ higher, aimed to maximize vertical decorrelation in the critical $1$ – $8\ \mathrm{kHz}$ band. Lee's design rationale is to keep the inter-layer coherence low enough that the height layer produces genuine vertical image spread rather than a fused ear-level image. PCMA-3D targets a 7.1.4-style bed (four main + four height, plus added surrounds/spots as needed).

OCT-9

OCT-9 extends Theile's OCT (Optimum Cardioid Triangle) surround array — a supercardioid L/R pair flanking a forward cardioid centre, with rear cardioids — by adding a height layer of (typically four) cardioids mounted above the front and side capsules. OCT's defining feature is high channel separation in the main layer (the supercardioids' rear rejection minimizes crosstalk between L and R), and OCT-9 carries that philosophy upward: the nine channels (the OCT main array plus the height quartet, with the LFE making the " $.1$ ") map naturally onto a 7.1.4 or 9.1 immersive bed. The height mics are spaced above the mains and aimed to favor early reflections.

2L-Cube

The 2L-Cube, associated with the Norwegian label 2L and engineer Morten Lindberg, arranges nine omnidirectional microphones on the vertices and faces of an imaginary cube roughly $1\ \mathrm{m}$ on a side, surrounding the ensemble (the players are typically seated in a circle around the array). Omnis are used for their full, natural low end and open sound; spacing alone provides the decorrelation, both horizontally and vertically. Because omnis have no directivity to reject the direct sound, the 2L-Cube relies entirely on the cube's geometry — the $\sim 1\ \mathrm{m}$ vertical edges give the $h\sin\phi$ time differences that lift the height layer. It is a purist, minimalist approach favored for acoustic and classical immersive recordings.

Hamasaki Square with height

The Hamasaki Square is a square (commonly $2$ – $3\ \mathrm{m}$ per side) of four figure-of-eight microphones with their nulls aimed at the stage, used to capture diffuse late reverberation while rejecting direct sound. Adding height means placing a second set of ambience capsules — typically four upward- or outward-facing mics — roughly $1\ \mathrm{m}$ above the square (sometimes called the "Hamasaki Cube" arrangement). Because the figure-of-eights already null the direct sound, the square's signals are intrinsically decorrelated from any main/spot array, and raising a second layer extends that diffuse capture into the upper hemisphere. This is an ambience array, almost always combined with a separate main/spot system for the direct sound.

ESMA-3D

The Equal Segment Microphone Array – 3D generalizes the ESMA concept (a ring of cardioids spaced so each adjacent pair subtends an equal segment of the playback circle, giving gap-free $360^\circ$ horizontal imaging) into a two-layer cube-like form. A lower ring of four cardioids covers the horizontal plane with equal $90^\circ$ segments; an upper ring of four cardioids, spaced above, covers the height layer. ESMA-3D aims for a coherent, evenly distributed surround-with-height image suitable for ambience and room capture, again at the 7.1.4 channel count.

Decca Cuboid

The Decca Cuboid takes the spaced-omni philosophy of the classic Decca tree and extends it vertically. The lower layer is a wide ring of omnidirectional capsules (think of the tree's L/C/R generalized to four corners of a roughly $1.4\ \mathrm{m}$ square), and an upper layer of four omnis sits about a metre above, tilted slightly skyward. Like the 2L-Cube it relies on pure spacing for decorrelation — there is no directivity to reject anything — but the corner geometry and larger horizontal footprint give it the big, weighty low end and "Decca sound" that engineers chose the tree for in the first place. Why it exists: to keep the unmistakably full, open omni timbre while adding a height layer that is decorrelated by the vertical spacing alone. Strengths: natural timbre, deep bass, broad enveloping image. Weaknesses: physically large, no direct-sound rejection, and localization is soft — the same trade every all-omni array makes.

ORTF-3D

ORTF-3D (a Schoeps configuration) stacks two near-coincident layers on the ORTF idea. The lower layer is the familiar ORTF-style splay of directional capsules covering the front/horizontal stage; an upper layer, raised and angled upward and outward, captures the height. Because each layer is near-coincident, the array keeps ORTF's good image focus and reasonable mono behaviour, while the modest spacing between layers supplies just enough vertical decorrelation to lift a height image. Why it exists: to bring the well-loved, compact, gig-friendly ORTF approach into the immersive era without resorting to a bulky spaced array. Strengths: compact, sharp horizontal imaging, travels well. Weaknesses: the small inter-layer spacing means vertical decorrelation must be watched, and like all near-coincident designs it gives less raw envelopment than a spaced cube.

Bowles Array

The Bowles array is a compact 3D design that pairs a lower ring of four cardioids covering the horizontal plane with an upper ring of four cardioids aimed more steeply upward. The defining choice is that steep height aim: by pointing the upper capsules well above the horizontal, the array maximizes the level and spectral difference between the bed and the height layer, which strengthens the perceived vertical spread. Why it exists: to get a clearly separated, convincing height layer from a small, easily rigged array of identical cardioids. Strengths: compact, uniform capsule type, strong height separation. Weaknesses: the steep height mics pick up less of the direct sound's character, so the ceiling can feel slightly disconnected if the bed/height balance is pushed too far.

Double-MS 3D

The Double-MS 3D array is the fully coincident answer to immersive capture: a single point holding a forward mid cardioid, a sideways figure-of-eight (the classic Mid–Side pair), and a third figure-of-eight oriented vertically, its nulls up and down. Matrixing the mid against the lateral eight recovers width exactly as in Mid–Side stereo; matrixing it against the vertical eight recovers height. Because every capsule shares one position there are no inter-capsule time differences and therefore no comb filtering — the array is perfectly mono-compatible and tiny. Why it exists: to drop an immersive mic where a spaced layer cannot go (a stand in a tight booth, a boom over a stage) while keeping a single, rotatable perspective. Strengths: coincident (no phase or comb problems), compact, mono-safe, matrixable to any width/height balance after the fact. Weaknesses: its vertical cue is a level difference from the vertical eight, and — as this chapter stresses — the vertical auditory system responds far better to decorrelation and spectral content than to level ratios, so coincident height reads as a gentle vertical spread rather than a placed overhead image. Like the first-order Ambisonic mic to which it is closely related, it trades envelopment for compactness.

Auro-3D (9.1)

The Auro-3D main array mirrors the 9.1 playback layout that gives the format its name: a 5.0 ear-level base (centre, front L/R, surround L/R) with a height layer placed directly above the base channels at roughly $30^\circ$ elevation — not at the ceiling. This is Auro-3D's signature geometry. Where Atmos thinks of a separate overhead "top" layer, Auro stacks a second ring only $\sim 30^\circ$ up, so each height capsule sits above its corresponding base mic and leans up by about that angle. Why it exists: to capture straight into an Auro 9.1 bed with a one-to-one mic-to-channel map, and to exploit the perceptually gentle $30^\circ$ elevation the format is built around. Strengths: maps directly onto Auro 9.1/10.1; a natural moderate height that integrates smoothly with the bed. Weaknesses: because each height capsule sits directly above its base mic and looks only modestly upward, inter-layer correlation runs higher than in ceiling-aimed designs — the vertical spacing and a touch of up-tilt must be watched, exactly as in the decorrelation discussion above, or the height collapses into the base.

Cardioid Cube

The Cardioid Cube is the directional sibling of the 2L-Cube: eight cardioids on the corners of a $\sim 1\ \mathrm{m}$ cube, the lower four aimed outward in the horizontal plane and the upper four aimed outward and up. Where the 2L-Cube leans entirely on spacing for decorrelation (its omnis reject nothing), the cardioid cube adds directivity — each capsule looks away from its diagonal opposite, so the cube captures more separated, less overlapping signals while keeping the same spacing-based vertical decorrelation. Why it exists: to get the broad, enveloping cube image with more channel separation and less mutual bleed than an all-omni cube, from one uniform capsule type. Strengths: better source separation and direct-sound rejection than the omni cube; uniform cardioids; strong envelopment. Weaknesses: cardioids give up the full, weighty low end and perfectly open timbre of omnis, and their off-axis colour can tint the diffuse field — the standard omni-vs-cardioid trade, now in three dimensions.

Fukada Tree (3D)

The Fukada Tree 3D lifts the much-loved Fukada tree into height. The base keeps Fukada's hybrid recipe — five cardioids (L/C/R plus surround L/R) for localization, flanked by two spaced omni outriggers that restore the warm low end and width cardioids lose — and adds a height quartet of cardioids about a metre above, aimed up. Why it exists: to carry the Fukada tree's celebrated balance of cardioid localization and omni warmth into an immersive bed, rather than the drier separation of OCT-9 or the soft all-omni image of the Decca Cuboid. Strengths: combines crisp frontal imaging, full low end from the outriggers, surround envelopment, and a decorrelated height layer in one coherent 11-channel main. Weaknesses: large and channel-heavy to rig, and the mix of cardioid and omni capsules across base and height demands care so the layers share a consistent timbre (see "Mismatched main and height capsule types" below).

Equally-spaced line arrays

A line array of microphones is simply $N$ identical capsules placed in a straight line at a uniform spacing $d$ . Unlike the imaging arrays above — which encode a single front stage into a small channel set — a line samples the sound field at regular spatial intervals, exactly the capture-side dual of a loudspeaker line array or of Wave Field Synthesis. Why it exists: uniform spatial sampling is the right tool when you want to beamform after the fact (steer or widen a pickup lobe by delaying and summing the capsules), feed a WFS/holographic reconstruction, or simply cover a wide, flat source (an organ, a row of singers) without favouring any one point. The regular spacing makes the array's spatial response predictable: closer spacing pushes the spatial-aliasing frequency higher (cleaner beamforming up to higher frequencies) at the cost of more capsules for a given length. Strengths: predictable, steerable, scales to any width, ideal for post-hoc beamforming and WFS capture. Weaknesses: not an imaging technique on its own — a raw line summed to a channel bed just sounds like comb-filtered mono unless it is beamformed or decoded; and like any spaced array it combs when its channels overlap in frequency.

In RIPL

Every array above is available as a one-click Array preset (and as a Virtual Microphone capsule rig), grouped by capsule count. A Spacing control scales the whole rig, and the line presets (4 and 8 forward cardioids) expose that scaling as the inter-capsule distance.

Comparison

Array	Main-layer capsules	Decorrelation principle	Height capsules	Height aim	Typical $h$	Target bed
PCMA-3D	4 near-coincident supercardioids	Vertical stacking tuned for $1$ – $8\ \mathrm{kHz}$ decorrelation	4 supercardioids stacked above mains	Tuned for vertical decorrelation	$\sim 1\ \mathrm{m}$	7.1.4
OCT-9	OCT (supercard. L/R, cardioid C, cardioid rears)	Directivity (rear rejection) + spacing	4 cardioids above front/sides	Up / forward-up	$\sim 1\ \mathrm{m}$	7.1.4 / 9.1
2L-Cube	4 omnis (lower face)	Pure spacing (no directivity)	4–5 omnis (upper face)	Open / outward	$\sim 1\ \mathrm{m}$	7.1.4 / 9.1.x
Hamasaki Square + height	4 figure-8 (nulls to stage)	Directivity null + large spacing	4 ambience mics above	Up / outward	$\sim 1\ \mathrm{m}$	Ambience for any bed
ESMA-3D	4 cardioids, equal $90^\circ$ segments	Equal-segment spacing + directivity	4 cardioids, upper ring	Up	$\sim 1\ \mathrm{m}$	7.1.4
Double-MS 3D	Coincident mid cardioid + lateral & vertical fig-8	Coincident — level/directivity only (no spacing)	Vertical figure-8 (matrixed)	Up/down null	$0$ (coincident)	Matrixable bed / scene
Auro-3D (9.1)	5.0 cardioid base (LCR + surround)	Directivity + spacing	4 cardioids above base, $\sim 30^\circ$ up	$\sim 30^\circ$ up	$\sim 0.6$ – $1\ \mathrm{m}$	Auro 9.1
Cardioid Cube	4 cardioids (lower, outward)	Spacing + directivity	4 cardioids (upper, outward-up)	Out / up	$\sim 1\ \mathrm{m}$	7.1.4
Fukada Tree 3D	5 cardioids + 2 omni outriggers	Directivity + spacing (omni width)	4 cardioids above, up	Up	$\sim 1\ \mathrm{m}$	7.1.4 / 9.1

A second table contrasts the strategies these arrays embody, echoing the coincident-vs-spaced dichotomy of the whole of Part IV:

Strategy	Encodes direction as	Vertical coherence	Strength	Weakness
Near-coincident main + stacked height (PCMA-3D, OCT-9)	Level (main) + small time offset (layers)	Low if tuned; risk of high	Compact, mono-compatible main	Height correlation must be actively managed
Spaced omni cube (2L-Cube)	Time differences only	Naturally low (spacing)	Natural timbre, broad image	Bulky; no direct-sound rejection; localization soft
Figure-8 ambience (Hamasaki + height)	Diffuse field, direction de-emphasized	Very low (null + spacing)	Superb envelopment, clean direct rejection	Captures no direct sound; needs a separate main array

Choosing an array

If your scene is a hall with a clear stage, pair a direct-sound main array (OCT, Decca tree) with a diffuse height system (Hamasaki-with-height) — let each layer do the job it is good at. If your scene is an immersive ensemble with no single front (a choir in the round, ambient field recording), a single coherent cube (2L-Cube, ESMA-3D) is more elegant.

Decorrelation Between Ear-Level and Height Layers

Vertical interchannel coherence

The governing quantity is the interchannel coherence between a height channel $x_h(t)$ and the main channel $x_m(t)$ beneath it. Using the normalized cross-correlation,

\gamma_{mh} = \frac{\displaystyle \int x_m(t)\,x_h(t)\,dt}{\sqrt{\displaystyle \int x_m^2(t)\,dt \int x_h^2(t)\,dt}}, \qquad -1 \le \gamma_{mh} \le 1.

When $\gamma_{mh} \to 1$ the two channels are essentially the same waveform (possibly scaled and delayed); the auditory system fuses them and, by the precedence effect, locks the image to whichever arrives first or louder — almost always the ear-level main layer. Height collapses. When $\gamma_{mh}$ is low (say, $|\gamma_{mh}| < 0.3$ across the elevation-critical band), the layers are heard as distinct contributions and a sense of vertical extent emerges.

This is the direct vertical analogue of horizontal decorrelation in direct, diffuse and envelopment: low inter-aural coherence widens and envelops the horizontal image; low inter-layer coherence lifts and opens the vertical image. In both cases the perceptual reward of decorrelation is spaciousness, and in both cases the penalty for too much decorrelation is loss of a stable, localizable image.

How geometry sets the coherence

Three knobs control $\gamma_{mh}$ :

Vertical spacing $h$ . Larger $h$ increases the time offset $\Delta t = h\sin\phi/c$ across frequency, which lowers broadband coherence — but only for sound arriving off the horizontal ( $\phi \ne 0$ ). For the direct frontal sound ( $\phi=0$ ) spacing alone does nothing.
Directivity and aim. Aiming the height capsules at the ceiling exploits the direction of arrival: ceiling reflections hit the height mic strongly and the main mic weakly (and vice versa for direct sound), driving $\gamma_{mh}$ down structurally rather than just statistically.
Frequency band. Coherence is naturally lower at high frequencies (where wavelengths are short relative to $h$ , so even small position differences scramble phase) and higher at low frequencies (where the layers see nearly the same pressure). Since elevation cues live high, height arrays are helped by the natural high-frequency decorrelation of any spacing.

Worked example — coherence vs. frequency for a $1\ \mathrm{m}$ vertical pair. Two spaced capsules separated by $h=1\ \mathrm{m}$ in a diffuse (isotropic) field have a theoretical coherence following the classic sinc relation

\gamma(f) = \mathrm{sinc}\!\left(\frac{2\pi f h}{c}\right) = \frac{\sin\!\big(2\pi f h / c\big)}{2\pi f h / c}.

The first zero occurs when $2\pi f h/c = \pi$ , i.e. $f = c/(2h) = 343/2 = 171.5\ \mathrm{Hz}$ . So above about $170\ \mathrm{Hz}$ a $1\ \mathrm{m}$ vertical spacing already yields substantially decorrelated diffuse field — and by the multi-kilohertz region where elevation cues live, coherence oscillates close to zero. The lesson: a $\sim 1\ \mathrm{m}$ spaced height layer is automatically well-decorrelated from the main layer in the diffuse field. The danger is never the reverberation; it is the direct sound, which is correlated regardless of spacing and must be handled by aim and directivity.

The cardinal sin of 3D recording

Feeding the height layer a delayed, level-scaled copy of the main bed — whether by accident (height mics pointed straight at the stage, no spacing) or in the mix (duplicating the bed "up top") — produces high $\gamma_{mh}$ , precedence-effect fusion, and comb filtering. The result is not "more height"; it is a hollow, phasey version of the ear-level image with no elevation at all. If in doubt, decorrelate.

Capturing for Object/Scene Workflows vs Channel Beds

Immersive delivery splits into two philosophies (see object-based and formats), and the capture strategy differs accordingly.

Channel beds

A channel bed is a fixed loudspeaker layout — 7.1.4, 9.1.6 — that the recording feeds directly. Layer-based arrays are made for this: the main array maps to the ear-level channels, the height layer to the overhead channels, with little or no re-rendering. The microphone geometry is the spatial format; what the array captures is what plays back. The Decca-tree-plus-height example below is a channel-bed workflow.

Object and scene (Ambisonic) workflows

In object-based delivery, sources are stored as mono signals plus positional metadata and rendered to whatever layout the listener has. In scene-based (Ambisonic) delivery, the whole sound field is encoded into a layout-independent set of spherical-harmonic channels and rendered on playback. For these, capture priorities shift:

Spot mics become objects. For object workflows you often want clean, isolated, mono spot microphones on individual sources, each later authored as an object with an arbitrary 3D position (including elevation set in the mix, not captured acoustically). The immersive array then provides only the bed — ambience, room, glue — beneath the objects.
One coherent point for scene capture. For Ambisonic scene capture you want a single coincident higher-order microphone (a spherical array) at one point, not a spread layer, because the spherical-harmonic model assumes a single observation point. A spaced 3D array and an Ambisonic mic are philosophically opposite: the former samples the field at many points and maps to channels; the latter samples directivity at one point and maps to a rotatable scene.

Capture once, decide later

A robust immersive session often records both: a coherent main-plus-height array for the channel bed, a Hamasaki-style diffuse height system for envelopment, and close spot mics on the key sources. This furnishes a channel-bed mix immediately and leaves the spots available to be re-authored as objects — buying flexibility against an uncertain delivery format.

A Single Coherent Array vs Many Spot Mics

This is the immersive restatement of the oldest debate in recording: purist main array versus multi-mic / spot capture. The trade-offs sharpen in 3D.

A single coherent array (2L-Cube, ESMA-3D, a tree-plus-height) captures the natural spatial and timbral relationships of the room in one consistent perspective. Its inter-channel time and level relationships are physically correct, so the height, depth, and width all hang together. The cost is control: you cannot rebalance one instrument without moving the whole image, the direct-to-reverberant ratio is fixed by mic placement, and a poor room is captured warts and all. The array also fixes the perspective at one listening point.

Many spot mics give per-source control of level, EQ, and (for object delivery) position, and they rescue dry or difficult rooms. But spots are time-incoherent with the main array (each is a different distance from each source) and from each other, so summing them risks comb filtering, smeared imaging, and a loss of the very spatial coherence that immersive playback is supposed to showcase. Worse, a spot mic carries no native height information; if it is panned up in the mix as a correlated copy, it triggers exactly the coherence collapse described above.

The practical reconciliation mirrors classical surround practice: a coherent immersive main + height array establishes the spatial frame and the height, while spots are added sparingly, time-aligned to the main array (delay each spot by its extra path length, $\Delta t_\text{spot} = d_\text{spot}/c$ ) and mixed low, used to add presence and intelligibility rather than to build the image. Spots intended to be "lifted" should be fed to the height layer through a decorrelator (a short, frequency-dependent delay or all-pass network) so they raise $\gamma_{mh}$ as little as possible.

Spots and height do not mix naively

A spot mic panned into both an ear-level and an overhead channel with the same signal is the duplicated-bed mistake in miniature. If you must give a spot apparent height, decorrelate the height feed (all-pass or micro-delay) and accept that you are buying spread, not a pinpoint elevated image — that is all the vertical auditory system will give you.

Worked Example: Extending a Decca Tree to 7.1.4

Let us design a concrete immersive array by adding a height quartet to a classic Decca tree, targeting a 7.1.4 bed (seven ear-level channels, one LFE, four height channels).

The starting main layer

A Decca tree is three spaced omnidirectional microphones in a "T": a centre omni forward, with left and right omnis spread to its sides and slightly behind. Take a standard medium tree:

Left–Right spacing: $2.0\ \mathrm{m}$ .
Centre forward of the L–R axis: $1.5\ \mathrm{m}$ .
Height of the tree above the stage floor sightline: set by the conductor's podium, typically $3$ – $3.5\ \mathrm{m}$ .

Spacing of $2.0\ \mathrm{m}$ gives generous inter-channel time differences. For a source $30^\circ$ off-axis, the L–R path difference is

\Delta d = 2.0 \times \sin 30^\circ = 1.0\ \mathrm{m}, \quad \Delta t = \frac{1.0}{343} \approx 2.9\ \mathrm{ms},

a strong time cue that places the source convincingly toward the relevant side — the spaced-array, time-difference logic of stereo microphone techniques extended across the front. For 7.1, the side and rear surround channels are supplied by a separate spaced pair or a Hamasaki square behind the tree; we focus on the height extension.

The height quartet

Mount four cardioids $h = 1.0\ \mathrm{m}$ above the main capsules and aim them upward (toward the ceiling/reflections), arranged as a rectangle to feed the four overhead channels:

Front height L / R: above and slightly behind the tree's L and R omnis, spread by about $2.0\ \mathrm{m}$ (matching the L–R baseline), tilted up by $\sim 30$ – $45^\circ$ .
Rear height L / R: about $1.5$ – $2.0\ \mathrm{m}$ behind the front height pair, also spread $\sim 2.0\ \mathrm{m}$ , aimed up at the hall.

Check the decorrelation. For the direct sound from the stage arriving at the front-height cardioids, their upward tilt of $\sim 45^\circ$ places the stage near the cardioid's $-3$ to $-6\ \mathrm{dB}$ off-axis region and, more importantly, exploits direction so the height mic hears mostly the first ceiling reflection rather than the direct path. For the diffuse reverberant field, the $1\ \mathrm{m}$ vertical spacing alone (first coherence zero at $\sim 170\ \mathrm{Hz}$ , per the sinc relation above) guarantees low $\gamma_{mh}$ across the elevation-critical band. The front-to-rear spacing of $\sim 1.75\ \mathrm{m}$ gives the overhead layer its own internal time differences:

\Delta t_\text{F-R} = \frac{1.75}{343} \approx 5.1\ \mathrm{ms},

enough to produce a sense of overhead depth (front-up vs rear-up) without exceeding fusion for reflections.

Time-aligning the layers (optional)

If a forward-facing height variant is used and comb filtering with the main layer is audible on direct sound, delay the main channels (or the height channels) so the layers' direct-sound arrivals coincide, then rely on directivity for decorrelation. With the height mics raised $1\ \mathrm{m}$ and aimed up, the direct-sound path from a stage source at horizontal distance $D$ differs between layers by

\Delta d_\text{layer} = \sqrt{D^2 + (z+h)^2} - \sqrt{D^2 + z^2},

where $z$ is the main mic's height above the source plane. For $D = 6\ \mathrm{m}$ , $z = 1.5\ \mathrm{m}$ , $h = 1.0\ \mathrm{m}$ :

\Delta d_\text{layer} = \sqrt{36 + 6.25} - \sqrt{36 + 2.25} = \sqrt{42.25} - \sqrt{38.25} \approx 6.50 - 6.18 = 0.32\ \mathrm{m},

so $\Delta t_\text{layer} \approx 0.32/343 \approx 0.93\ \mathrm{ms}$ — about a $1\ \mathrm{ms}$ inter-layer offset on direct sound, corresponding to comb-filter notches spaced $\Delta f = 1/\Delta t \approx 1.07\ \mathrm{kHz}$ apart. If the height mics are correlated enough with the main layer for this to matter, a $1\ \mathrm{ms}$ delay alignment (or, better, aiming the height mics up so the direct sound is rejected) removes it. With upward aim, the direct sound is attenuated enough that the comb is inaudible and no alignment is needed.

Resulting channel map

7.1.4 channel	Source in this array
L, C, R	Decca tree L, C, R omnis
L surround, R surround	Spaced rear pair / Hamasaki square sides
L back, R back	Hamasaki square rears (or rear spaced pair)
LFE	Derived/optional (low-pass sum or dedicated)
Top front L / R	Height cardioids front L / R (up-tilted)
Top rear L / R	Height cardioids rear L / R (up-aimed)

This array delivers a natural, coherent immersive image: the tree provides the warm, well-localized frontal stage; the rear/surround mics provide horizontal envelopment; the height quartet provides decorrelated overhead reflection energy that opens the hall vertically — all captured in one consistent perspective and mapping directly onto the channel bed with no re-rendering.

Common Mistakes and Pitfalls

Height too correlated with the bed

The most frequent failure, discussed throughout this chapter: height mics aimed straight at the stage with little vertical spacing, or a height layer synthesized in the mix from copies of the bed. The symptom is that switching the height channels on and off changes the level and perhaps the tone but not the elevation — the image stays glued to ear level. The cause is high $\gamma_{mh}$ and precedence-effect fusion. The cure is decorrelation: raise the spacing toward $1\ \mathrm{m}$ , aim the height capsules up, and never duplicate the bed upward without a decorrelator.

Comb filtering between layers

When the layers are correlated and offset by a few milliseconds, their sum is comb-filtered: a series of notches at $f_n = n/\Delta t$ . On direct sound this is heard as hollowness or a metallic edge. The fix is either to remove the correlation (aim height mics away from the direct source) or to time-align the layers and accept the resulting (now coherent) sum — but the former is almost always preferable because it also delivers the decorrelation that height needs.

Over-decorrelation and a detached ceiling

The opposite error: a height layer so independent (huge spacing, totally different content, artificial decorrelation piled on) that it floats free of the bed, perceived as a separate "upstairs" event rather than an extension of the same space. The scene loses cohesion. Height should be of the same room as the bed — share the reverberant tail, keep the levels moderate ( $-3$ to $-9\ \mathrm{dB}$ relative to the bed is a common starting region), and avoid stacking artificial decorrelation on top of geometric decorrelation that is already sufficient.

Bright or dull height layer

Because elevation cues live above $\sim 5\ \mathrm{kHz}$ , a height layer rolled off in the top (cheap capsules, off-axis coloration from extreme up-tilt, excessive distance) delivers little perceived height. Conversely, an over-bright height layer can pull all sources upward via the directional-band effect, smearing the vertical image. Match the timbre of the height capsules to the mains and keep their high-frequency response honest.

Mismatched main and height capsule types

Mixing omni mains with cardioid heights (or vice versa) is sometimes deliberate, but it changes the direct-to-reverberant ratio and timbre between layers, which can make the height sound like a different recording. If a natural, seamless dome is the goal, matching capsule types (the 2L-Cube's all-omni philosophy, or an all-cardioid ESMA-3D) yields the most coherent result.

Limits of 3D Capture

Even a perfectly executed immersive array faces hard limits, most rooted in the psychoacoustics of elevation:

Vertical localization is intrinsically coarse. Median-plane localization blur is far larger than horizontal blur (tens of degrees versus a few degrees), and it rests on individual pinna cues that no loudspeaker array reproduces exactly. Captured height is therefore always more "spread" than "pinpoint." You can convincingly raise the envelopment; you can rarely place a precise point overhead the way you place one to the side.
The single-perspective problem. A spaced 3D array captures the field correct for one listening point. Off the sweet spot, inter-channel time differences no longer correspond to the listener's geometry, and both horizontal and vertical images degrade — the same sweet-spot limitation that constrains all channel-based reproduction, now in three dimensions.
Bottom-hemisphere capture is essentially impossible in practice. Sources and reflections from below the array (below-ear-level content) are rarely captured or reproduced; floor reflections arrive at low elevation and are usually folded into the ear-level layer. Full-sphere capture belongs to higher-order Ambisonics, and even there the lower hemisphere is sparsely furnished by real loudspeaker layouts.
Format dependence. A channel-bed array baked for 7.1.4 does not automatically translate to 9.1.6 or to binaural; down/up-mixing introduces its own coherence and timbre artifacts. Object and scene workflows trade this for rendering flexibility, at the cost of needing a fundamentally different (coincident/spherical) capture approach.
Reverberation-dominated upper hemisphere. Because the upper hemisphere is mostly reflections, 3D capture is largely spatial enhancement (spaciousness, envelopment, listener envelopment) rather than the addition of new localizable sources. This is a real perceptual benefit — height demonstrably increases preference and immersion in listening tests — but it is not the same as "more instruments, now overhead."

The honest summary

3D capture reliably buys you vertical spaciousness and envelopment and an open, hall-like top end; it buys you precise overhead localization only weakly and only for broadband, high-frequency-rich content. Design for the former, and the latter takes care of itself where the physics allows.

Connecting Back to the Whole Guide

The immersive array closes the loop opened in Principles of Spatial Capture. The horizontal main layer encodes azimuth exactly as a stereo or surround array does — coincident sub-arrays via level, spaced sub-arrays via time — and those are the same ILD and ITD cues your ears use (psychoacoustics) and the same cues the playback techniques re-synthesize. The height layer adds a third coding axis, but it does so with a different currency: not a clean phantom-panning cue but decorrelation and spectral content, because the vertical auditory system is built differently from the horizontal one. Capture and reproduction remain duals — a 7.1.4 array feeds a 7.1.4 rig, a scene mic feeds an Ambisonic decoder — and the immersive recordist's craft is to respect, at the microphone, the perceptual machinery that will ultimately decode the result. The next step on the input side, Ambisonic recording, takes the opposite path to the same goal: not many spaced capsules mapped to channels, but one coincident array mapped to a rotatable, format-agnostic sound field.

References

Howie, W. Capturing Orchestral Music for Three-Dimensional Audio Playback. Doctoral dissertation / AES papers on 9.1 (Auro-3D) and immersive orchestral recording, McGill University.
Lee, Hyunkook. "Capturing 360° Audio Using an Equal Segment Microphone Array (ESMA)." Journal of the Audio Engineering Society, and related papers on PCMA-3D and vertical localization / Vertical Image Spread.
Lee, Hyunkook, and Christof Gribben. "Effect of Vertical Microphone Layer Spacing for a 3D Microphone Array." Journal of the Audio Engineering Society, 62(12), 2014.
Theile, Günther, and Helmut Wittek. "Principles in Surround Recordings with Height." Proceedings of the AES 130th Convention, 2011.
Rumsey, Francis. Spatial Audio. Focal Press, 2001 (chapters on multichannel and 3D microphone technique).
Eargle, John. The Microphone Book, 3rd ed. Focal Press, 2012 (stereo/surround arrays, capsule directivity).
Williams, Michael. "Microphone Array Analysis for Multichannel Sound Recording" and SRA / Multichannel Microphone Array Design (MMAD) publications, AES.
Wittek, Helmut, and Günther Theile. "Spatial Perception of Decorrelated and Coherent Signals in Multichannel Reproduction." AES Convention papers.

← Back to Recording

The New Axis: Adding Height to Capture​

From a ring to a hemisphere​

Why height is special​

The Perceptual Basis: How Height Is Heard​

Spectral cues and the pinna​

Vertical interchannel cues are not the same as horizontal ones​

Layer-Based Capture: A Main Array Plus a Height Layer​

The dominant paradigm​

Spacing the height layer above the main pair​

Named 3D Arrays​

PCMA-3D​

OCT-9​

2L-Cube​

Hamasaki Square with height​

ESMA-3D​

Decca Cuboid​

ORTF-3D​

Bowles Array​

Double-MS 3D​

Auro-3D (9.1)​

Cardioid Cube​

Fukada Tree (3D)​

Equally-spaced line arrays​

Comparison​

Decorrelation Between Ear-Level and Height Layers​

Vertical interchannel coherence​

How geometry sets the coherence​

Capturing for Object/Scene Workflows vs Channel Beds​

Channel beds​

Object and scene (Ambisonic) workflows​

A Single Coherent Array vs Many Spot Mics​

Worked Example: Extending a Decca Tree to 7.1.4​

The starting main layer​

The height quartet​

Time-aligning the layers (optional)​

Resulting channel map​

Common Mistakes and Pitfalls​

Height too correlated with the bed​

Comb filtering between layers​

Over-decorrelation and a detached ceiling​

Bright or dull height layer​

Mismatched main and height capsule types​

Limits of 3D Capture​

Connecting Back to the Whole Guide​

References​