Ambisonic & Scene-Based Recording

Every other microphone technique in this part of the guide answers the question "where do I place microphones so the channels I record sound right on my loudspeakers?" Coincident pairs, spaced pairs, surround trees and 3D arrays are all designed around a particular reproduction layout: a stereo pair, a 5.1 ring, a 7.1.4 dome. Change the layout and the recording, strictly speaking, no longer fits. Scene-based recording asks a different and more ambitious question: "how do I capture the sound field itself, at one point in space, in a form that does not yet commit to any loudspeaker layout?" The answer is Ambisonics, and the recording side of that answer is the tetrahedral microphone and its higher-order spherical cousins.

This chapter is the capture-side companion to the Ambisonics playback chapter. There we treated B-format as something a panner or synthesizer produces; here we treat it as something a microphone array measures. The same spherical-harmonic representation that lets us pan and decode to arbitrary layouts is, it turns out, almost exactly what a cluster of closely spaced directional capsules naturally captures. That coincidence — that a physical microphone and a mathematical basis function line up — is what makes Ambisonic recording so powerful and so peculiar.

Why Capture a Whole Sound Field at a Point

Recall from the principles of spatial capture that all spatial microphone techniques encode direction using two cues the ear itself uses (see psychoacoustics): level differences between channels (the basis of coincident/intensity techniques) and time differences (the basis of spaced techniques). Scene-based recording is squarely an intensity technique — its capsules are coincident, or nearly so — but it generalizes the idea. Instead of two channels whose level difference encodes one azimuth, it records a small set of channels whose relative levels encode the full directional distribution of energy arriving at a point.

The point-field idea

Imagine standing at one spot in a concert hall and asking: from which directions is sound arriving, and how much from each? At any instant the answer is a function on the sphere — an intensity (and phase) for every direction $(\theta, \phi)$ of azimuth and elevation. That directional function is exactly what we want to capture. We cannot record an infinite number of directions, so we expand the function in a basis of spherical harmonics and keep the first few terms. A first-order expansion keeps four terms; that four-channel signal is B-format, and it is the Ambisonic "scene."

The crucial property is that the scene is stored independently of any loudspeaker layout. A B-format recording does not "know" whether it will be played on headphones, a quad rig, a 22.2 dome, or a wave field synthesis array. The commitment to a layout happens later, at decode time, exactly mirroring the encode/decode split described in the Ambisonics chapter. This is the defining advantage of scene-based capture, and the reason it dominates VR and 360 video.

Capture and reproduction as duals

The recurring theme of this part of the guide is that capture and reproduction are duals. In channel-based stereo, an intensity pair records a level difference that a loudspeaker pair later converts back into a phantom image. In Ambisonics the duality is even cleaner: the encoding equations are the same whether the spherical-harmonic coefficients come from a synthetic panner or from a microphone. A tetrahedral mic is, in effect, a four-channel Ambisonic encoder whose "panning law" is dictated by physics rather than chosen by a mixing engineer. Everything you learned about decoding B-format to loudspeakers in /guide/techniques/ambisonics applies unchanged to a recorded scene.

Key takeaway

A scene-based recording stores the sound field at a point, expanded in spherical harmonics, not a set of speaker feeds. The layout decision is deferred to decode time — which is precisely why the same recording can serve headphones, a 5.1 ring, or a dome.

The Tetrahedral Microphone

The original Ambisonic microphone, conceived by Michael Gerzon and Peter Craven in the early 1970s and commercialized as the Calrec/SoundField microphone, places four capsules on the faces of a regular tetrahedron. The tetrahedron is the minimal arrangement that samples all three spatial axes symmetrically with as few capsules as possible — four points, maximally spread, no two pointing the same way.

Capsule geometry

Each capsule is a directional element — classically sub-cardioid to cardioid — mounted on one face of the tetrahedron so its main axis points outward from the center along that face's normal. Label the four capsules by the sign pattern of their pointing directions. A common convention places them at:

LF (Left-Front-Up): direction $(+,+,+)$
RF (Right-Front-Down): direction $(+,-,-)$
LB (Left-Back-Down): direction $(-,+,-)$
RB (Right-Back-Up): direction $(-,-,+)$

so that, normalizing, each capsule axis points to a vertex of a tetrahedron inscribed in a cube. Two capsules point generally forward and two generally backward; two point up and two down; two left and two right. This balanced sign structure is what later lets simple sums and differences recover the cardinal axes.

The raw four-channel output of these capsules is called A-format. A-format is not directly usable: each channel is just the pressure-plus-velocity response of one tilted cardioid. It must be matrixed into B-format before it means anything in the Ambisonic sense.

Why a tetrahedron, and how big

Two pressures requirements pull in opposite directions. To behave as a coincident (intensity) array, the four capsules should occupy the same point — any spacing introduces inter-capsule time differences that corrupt the directional encoding at high frequencies. But four physical capsules cannot occupy the same point; they sit on a tetrahedron of some finite radius $r$ (typically on the order of $r \approx 1.2$ – $1.5\,\text{cm}$ for a first-order mic). The radius is a compromise: small enough that the array is "coincident" across most of the audio band, large enough to fit real capsules. We return to the frequency consequences in the spatial-aliasing section.

A-format vs B-format

A-format is the raw capsule signals — four tilted cardioids, useful to no one until processed. B-format is the spherical-harmonic representation — $W$ (omni/pressure) plus $X, Y, Z$ (the three figure-of-eight velocity components). All Ambisonic processing, decoding and rotation operates on B-format, never on A-format.

A-Format to B-Format Conversion

The heart of a tetrahedral microphone is not the capsules — it is the matrixing and filtering that turns four tilted cardioids into the four B-format channels $W, X, Y, Z$ .

The ideal matrix

A first-order cardioid pointing in unit direction $\hat{\mathbf{u}}$ has the directional response

g(\theta,\phi) = \tfrac{1}{2}\bigl(1 + \hat{\mathbf{u}}\cdot\hat{\mathbf{d}}\bigr),

where $\hat{\mathbf{d}}$ is the direction of an incoming plane wave. The $\tfrac{1}{2}$ constant is the omnidirectional (pressure) part and the $\tfrac{1}{2}\,\hat{\mathbf{u}}\cdot\hat{\mathbf{d}}$ term is the figure-of-eight (velocity) part projected onto the capsule axis. So each cardioid already contains a pinch of omni and a pinch of a dipole aligned with its own axis.

The B-format channels we want are themselves an omni and three orthogonal dipoles:

\begin{aligned} W &\propto 1 & &\text{(pressure, omnidirectional)}\\ X &\propto \cos\phi\cos\theta & &\text{(front–back dipole)}\\ Y &\propto \cos\phi\sin\theta & &\text{(left–right dipole)}\\ Z &\propto \sin\phi & &\text{(up–down dipole)} \end{aligned}

Because each capsule is a known linear combination of pressure and the three velocity components, recovering $W, X, Y, Z$ is a matter of taking the right sums and differences of the four A-format signals. With the sign convention above, the recovery is wonderfully simple:

\begin{aligned} W &= \tfrac{1}{2}\bigl(\text{LF} + \text{RF} + \text{LB} + \text{RB}\bigr),\\ X &= \tfrac{1}{2}\bigl(\text{LF} + \text{RF} - \text{LB} - \text{RB}\bigr),\\ Y &= \tfrac{1}{2}\bigl(\text{LF} - \text{RF} + \text{LB} - \text{RB}\bigr),\\ Z &= \tfrac{1}{2}\bigl(\text{LF} - \text{RF} - \text{LB} + \text{RB}\bigr). \end{aligned}

Read these aloud and the geometry is transparent. Sum all four capsules and the dipole parts (which have opposite signs front/back, left/right, up/down) cancel, leaving pure pressure — that is $W$ . Add the two front-pointing and subtract the two back-pointing capsules and the pressure parts cancel while the front–back velocity reinforces — that is $X$ . The same logic, with the appropriate sign pattern, yields $Y$ (left–right) and $Z$ (up–down). This is the same "encode direction as a level pattern, decode by sum/difference" principle as M/S stereo in the stereo techniques chapter, generalized to three dimensions.

Why filtering is unavoidable

If the four capsules were truly coincident, that matrix would be the whole story. They are not: they sit at the vertices of a tetrahedron of radius $r$ . Two effects follow, both frequency-dependent.

First, phase/time differences. A wave arriving from a given direction reaches the four capsules at slightly different times, $\Delta t = (\hat{\mathbf{d}}\cdot \Delta\mathbf{r})/c$ , where $\Delta\mathbf{r}$ is the spacing between two capsules and $c \approx 343\,\text{m/s}$ . At low frequencies the wavelength dwarfs $r$ and the four capsules are effectively coincident; the simple matrix is exact. As frequency rises and the wavelength approaches $r$ , the sums and differences no longer cleanly separate pressure from velocity. The difference signals ( $X, Y, Z$ ) lose level (the capsules start to "see the same thing") while the sum ( $W$ ) gains, distorting the relative gains.

Second, the omni and dipole derivations roll off differently. The pressure estimate $W$ behaves well, but the velocity estimates $X, Y, Z$ — being differences of nearby pressures — fall off at low frequency relative to $W$ unless boosted.

The fix is a set of frequency-dependent correction (equalization) filters, sometimes called the A-to-B filters or capsule-spacing compensation. These shelving/phase filters boost the figure-of-eight channels at the frequencies where capsule spacing starts to attenuate them and gently correct the $W$ / $XYZ$ balance, so that the resulting B-format matches the ideal spherical-harmonic responses over as wide a band as possible. SoundField-type processors implement these filters in hardware or software; modern plugin "A-format decoders" do the same in DSP, often with per-capsule calibration data baked in.

Rule of thumb

Never use the raw sum/difference matrix alone on a real tetrahedral mic. Always run A-format through the manufacturer's (or a calibrated) A-to-B conversion, which includes the spacing-correction filters. Skipping the filters gives a B-format that is dull and directionally smeared at high frequency and weak in the dipoles at low frequency.

Worked example: deriving W and X

Suppose a plane wave arrives from directly in front, $\theta = 0^\circ, \phi = 0^\circ$ , so $\hat{\mathbf{d}} = (1,0,0)$ . Take perfect cardioids and ignore spacing. The forward-leaning capsules LF and RF have a positive $X$ -projection of their axes, the rearward LB and RB a negative one. Plugging $\hat{\mathbf{u}}\cdot\hat{\mathbf{d}}$ for each into $g = \tfrac{1}{2}(1+\hat{\mathbf{u}}\cdot\hat{\mathbf{d}})$ and summing per the matrix gives $W \propto 1$ (full pressure, the same for any direction) and $X$ at its maximum positive value, with $Y = Z = 0$ . Rotate the source to the rear, $\hat{\mathbf{d}} = (-1,0,0)$ , and $X$ flips to its maximum negative value while $W$ is unchanged. The pair $(W, X)$ therefore encodes front-vs-back as a level-and-sign pattern — exactly the intensity-difference cue, now lifted into a full 3D basis.

First-Order Limits and Higher-Order Microphones

A first-order B-format scene has four channels and an angular resolution that is, frankly, coarse. The first-order directivity patterns ( $W$ omni plus three dipoles) are broad; a decoded source is spread over a wide arc, localization is soft, and two sources close in angle blur together. To sharpen the scene you need higher-order Ambisonics (HOA), which means more spherical-harmonic terms — and capturing more terms means more capsules.

How many channels per order

The number of spherical harmonics up to and including order $N$ is

\text{channels} = (N+1)^2.

So first order ( $N=1$ ) has $4$ channels, second order ( $N=2$ ) has $9$ , third order ( $N=3$ ) has $16$ , fourth order has $25$ , and so on. Each successive order adds $2N+1$ new harmonics of finer angular detail. The angular "spot size" of the equivalent beam narrows roughly as the order increases, so higher order means tighter localization and better separation of nearby sources.

Order $N$	Channels $(N+1)^2$	Min. capsules (practical)	Angular character
1	4	4 (tetrahedron)	Broad, soft images
2	9	$\geq 9$	Noticeably tighter
3	16	16–20	Good separation
4	25	$\geq 25$ (often 32)	Sharp, large sweet area
5–7	36–64	50+	Very sharp; rarely from a single mic

Spherical microphone arrays

To capture order $N$ you need at least $(N+1)^2$ capsules, distributed as evenly as possible over a rigid or open sphere so that all harmonics up to order $N$ are sampled without bias. In practice arrays use more than the bare minimum to over-determine the fit and improve robustness. The best-known example is the 32-capsule rigid-sphere array (the em32 / Eigenmike type), which supports clean fourth-order capture ( $25$ channels) with the extra capsules providing headroom and noise averaging. The capsules sit flush on a solid sphere a few centimetres in radius; the rigid baffle's known scattering is part of the encoding model.

Conversion from the raw capsule signals to HOA B-format is the higher-order analogue of A-to-B: a matrix of modal beamforming filters, derived from the array geometry and the sphere's scattering, projects the capsule pressures onto each spherical harmonic. Because the higher-order harmonics are increasingly hard to extract (the differences between nearby capsules get tiny), the encoding filters apply large low-frequency boosts to the high-order channels — which amplifies self-noise. This noise penalty is the practical ceiling on usable order, not the channel count alone.

Order is not free

Doubling the spherical-harmonic order does not just add channels — it adds self-noise and radius-dependent band limits. A 32-capsule sphere advertised as "fourth order" delivers clean fourth order only over a limited mid-band; outside it, the effective order drops. Treat the headline order as a best-case, not a guarantee across the whole spectrum.

Spatial Aliasing and the Frequency Range of Arrays

Every microphone array, first-order or high-order, is a spatial sampler: it samples the sound field at discrete capsule positions. Just as time sampling aliases above the Nyquist frequency, spatial sampling aliases above a frequency set by the spacing of the capsules. This single fact governs the usable bandwidth of any Ambisonic mic.

The aliasing frequency

The array behaves correctly while the acoustic wavelength is large compared with the capsule spacing — that is, while the array is effectively coincident. Aliasing sets in when the wavelength shrinks to roughly twice the capsule spacing (or, equivalently, the array radius), so that the spatial pattern can no longer be told apart from a different incoming direction. A useful estimate of the upper limit is

f_{\text{alias}} \approx \frac{c}{2\,d},

where $c \approx 343\,\text{m/s}$ and $d$ is the relevant capsule spacing (for a sphere of radius $a$ capturing order $N$ , the relevant scale is roughly the arc between adjacent capsules, which shrinks as $a/N$ ). The headline consequences:

Small radius → high aliasing frequency but weak low-frequency directionality (the dipole differences vanish into noise at low $f$ ).
Large radius → strong low-frequency directionality but low aliasing frequency (the scene degrades to a duller order in the highs).
Higher order on a fixed radius → lower aliasing frequency, because the relevant spacing is the arc between capsules, which gets smaller as you cram in more capsules.

There is no escaping this trade: the array radius and the target order jointly set the usable band. Above $f_{\text{alias}}$ the high-order channels are not wrong so much as unreliable — they describe a direction the array can no longer distinguish — and good encoders gracefully reduce the effective order with rising frequency.

Worked example: tetrahedral aliasing frequency

Take a first-order tetrahedral mic whose capsules sit at radius $r \approx 1.2\,\text{cm}$ , giving an effective inter-capsule spacing of roughly $d \approx 2\,\text{cm} = 0.02\,\text{m}$ (the edge of the tetrahedron). Then

f_{\text{alias}} \approx \frac{343}{2 \times 0.02} \approx 8.6\,\text{kHz}.

So a typical first-order mic is genuinely coincident and directionally accurate up to roughly $8$ – $10\,\text{kHz}$ ; above that the spacing-correction filters are extrapolating, the directional cues soften, and the mic increasingly behaves like a cluster of slightly spaced cardioids. This is why even excellent first-order recordings can sound a touch "loose" in the top octaves and why manufacturers tune the high-end EQ as much by ear as by theory.

Worked example: a sphere at higher order

Take a rigid-sphere array of radius $a = 4.2\,\text{cm}$ with $32$ capsules. The capsules are spread over the sphere, so the nearest-neighbour arc spacing is on the order of $d \approx 2.5\,\text{cm}$ . That gives

f_{\text{alias}} \approx \frac{343}{2 \times 0.025} \approx 6.9\,\text{kHz}

for the full fourth-order content. Below a few kHz the array delivers clean fourth order; through the upper mids the usable order drops progressively (third, then second), and by the top of the band it is effectively first order. The headline "fourth-order, 32 capsules" is true only in the band where both the aliasing limit and the low-frequency noise limit permit it. A well-designed encoder therefore renders fourth order in the mids, tapering to lower order at the extremes — a frequency-dependent order that the listener should never notice if it is done well.

The single biggest misconception

"More capsules / higher order = better at all frequencies." It is not. Every array has a band of full performance bounded below by self-noise and above by spatial aliasing. Outside that band the effective order collapses. Match the array (radius and order) to the bandwidth and localization sharpness your content actually needs, not to the number on the box.

Processing: Calibration, Conventions, and Rotation

A recorded scene is only as good as its calibration and only as portable as its channel conventions. This section is the bridge between the raw capture and a usable, exchangeable B-format file.

Calibration

Real capsules differ: gain mismatches of a fraction of a decibel and phase mismatches of a few degrees between the four (or thirty-two) capsules are normal. Because B-format is built from differences of capsules, these small mismatches do not average out — they leak. A gain error between LF and LB, for instance, injects a spurious $X$ component even for an omnidirectional source, tilting the whole scene forward or back. Calibration measures each capsule's gain and phase (often per frequency band) and bakes correction into the A-to-B (or modal-beamforming) matrix. Many mics ship with a per-unit calibration file for exactly this reason; using the generic filter set instead of the unit-specific one is a common, avoidable error.

B-format conventions: ACN and SN3D

Historically, first-order B-format used the FuMa (Furse-Malham) convention: channel order $W, X, Y, Z$ with the famous $W$ scaled by $1/\sqrt{2}$ so that the four channels had comparable levels. Modern practice — and the format that VR pipelines, game engines and the YouTube 360 standard expect — is AmbiX, which combines two conventions:

ACN (Ambisonic Channel Number) fixes the channel ordering. Each spherical harmonic of order $n$ and degree $m$ gets index $\text{ACN} = n^2 + n + m$ . So the channels run $W$ (ACN 0), then $Y, Z, X$ (ACN 1,2,3) — note the $Y, Z, X$ order, not $X, Y, Z$ — then the nine second-order harmonics, and so on.
SN3D (Schmidt semi-normalized) fixes the gain of each harmonic so that no channel ever exceeds the level of $W$ for a single plane wave, which keeps levels well-behaved as order rises. (The closely related N3D is fully orthonormal and preferred for some DSP; AmbiX uses SN3D.)

Mixing conventions is the single most common cause of "the scene is rotated, mirrored, or has wrong elevation" problems. A FuMa file fed to an AmbiX decoder, or an $X/Y/Z$ ordering sent to a $Y/Z/X$ decoder, produces a plausible-but-wrong field. Always tag and convert explicitly.

Convention mismatch ruins the scene

The most frequent post-production failure with recorded Ambisonics is not a bad mic — it is feeding a B-format file in one convention (say FuMa, $WXYZ$ , $W$ at $-3\,\text{dB}$ ) to a tool expecting another (AmbiX, ACN/SN3D, $WYZX$ ). Symptoms: front-back flips, left-right mirrors, tilted horizon. Confirm channel order and normalization before you decode.

Rotation, tilt, and zoom in post

Here lies the standout advantage of scene-based capture. Because B-format is a representation of the whole field in an orthogonal basis, you can rotate the entire recorded scene after the fact by applying a rotation matrix to the channels — and the result is acoustically equivalent to having physically turned the microphone. For first order this is a simple $3\times 3$ rotation acting on $(X, Y, Z)$ (with $W$ untouched); for higher orders it is a block-diagonal matrix of larger rotation blocks per order. A yaw of $\psi$ about the vertical axis, for example, mixes $X$ and $Y$ :

\begin{aligned} X' &= X\cos\psi + Y\sin\psi,\\ Y' &= -X\sin\psi + Y\cos\psi,\\ Z' &= Z, \quad W' = W. \end{aligned}

This is the recording-side embodiment of the rotation advantage discussed in the Ambisonics chapter: in 360 video you slave this rotation to the viewer's head-tracking so the soundscape stays locked to the visual world as they look around. "Zoom" controls (front-back emphasis) and dominance transforms similarly reweight the harmonics to push the apparent perspective forward or back. None of this is possible with a fixed channel-based recording: you cannot re-aim a finished 5.1 mix by matrix multiplication, but you can re-aim a B-format scene perfectly.

Binaural and Loudspeaker Decoding of Recorded Scenes

A B-format scene is inert until decoded. Because the recording is layout-agnostic, the same file drives wildly different reproduction systems — the dual of the multi-layout panning described in the techniques part.

Loudspeaker decoding

To play a recorded scene over loudspeakers you apply an Ambisonic decoder matrix matched to the speaker layout, exactly as in /guide/techniques/ambisonics. Each loudspeaker feed is a weighted sum of the B-format channels, the weights chosen so the reconstructed field's velocity and energy vectors point at the intended direction. A first-order scene decodes to a square, a 5.0/7.0 ring, a cube, or a full dome with nothing more than a change of decoder; a fourth-order scene decodes to large rigs with a wide sweet area. Importantly, the decoder can be a mode-matching decoder for small/regular arrays or an All-Round Ambisonic Decoder (AllRAD) for irregular ones — and for large arrays the decode philosophy connects to wave field synthesis, which can be seen as the high-density, near-field-correct limit of driving many loudspeakers from a sound-field description.

Binaural decoding

For headphones — the dominant case in VR — the scene is decoded binaurally (see /guide/techniques/binaural). The elegant method is virtual loudspeaker decoding: decode the B-format to a dense set of virtual speaker directions, then convolve each virtual speaker with the HRTF for its direction and sum to two ears. Equivalently, one precomputes a set of spherical-harmonic-domain HRTF filters so the binaural signal is a direct weighted sum of the B-format channels — one filter pair per Ambisonic channel — which is far cheaper. When the listener turns their head, you first apply the rotation matrix above to the scene, then the fixed binaural filters; the soundscape stays world-locked with very low latency. This rotate-then-render chain is the single most important reason scene-based capture won VR audio.

Decode once you know the destination

Keep your masters in B-format (AmbiX) for as long as possible and decode last, per target: binaural for headphones, a ring/dome decoder for speakers. The same scene, one rotation engine, many outputs. This is the practical embodiment of "capture the field, commit to a layout late."

Pros and Cons vs Coincident, Spaced, and Main Arrays

How does scene-based capture stack up against the stereo and surround techniques from the earlier chapters? The honest answer is that it trades raw sound quality and image precision for flexibility and rotatability.

Property	Tetrahedral / HOA (scene-based)	Coincident pair (XY/MS, stereo techniques)	Spaced pair / main array (surround)
Direction cue used	Level pattern across SH channels (intensity)	Level difference (intensity)	Time + level difference
Mono compatibility	Excellent ( $W$ is true omni)	Excellent	Poor (comb filtering on sum)
Layout flexibility	Total — any layout by decode	Fixed to the pair	Fixed to the array
Rotatable in post	Yes, exactly	No	No
First-order localization	Soft / broad	Sharp	Sharp + spacious
Envelopment / spaciousness	Good (full field), order-limited	Modest	Excellent
Channel count	$(N+1)^2$	2	4–12+
Self-noise	Higher (esp. HOA)	Low	Low
Best for	VR, 360, rotatable content	Stable stereo imaging	Immersive room sound

When scene-based capture wins

Scene-based capture is the clear winner whenever the listener's orientation is not fixed at capture time: 360-degree video, VR and AR, and any rotatable or interactive content. It is also unbeatable when you must deliver one recording to many layouts (headphones, 5.1, 7.1.4, dome) without re-recording, and when you need a true point-of-view ambience that head-tracks. For room tone, atmospheres, foley ambiences and immersive backgrounds it is superb because $W$ gives a genuine mono-compatible omni and the field rotates to match picture.

When the older techniques win

For a front-stage classical or jazz recording auditioned on a fixed stereo or 5.1 system, a well-judged coincident pair or a spaced main array (Decca tree, ORTF-surround, the 3D arrays in /guide/recording/immersive-3d-recording) usually beats a first-order tetrahedral mic on image sharpness, depth and sheer envelopment, because spaced microphones exploit time cues that a coincident sound-field mic, by construction, does not capture well. Many engineers therefore use the tetrahedral mic as a room/ambience component blended with spot mics and a main array — getting the best of both worlds: precise direct sound from the main system, rotatable enveloping ambience from the scene mic.

Hybrid is normal

In professional immersive work the Ambisonic mic is rarely the only mic. It is commonly the room/atmosphere layer, while spot and main microphones carry the sharp foreground. Scene-based and channel-based capture are complementary, not rivals.

Worked Example: From Tetrahedral Capture to a Decoded Square

Let us trace one full pipeline end to end: a tetrahedral mic in a room, a single source in front-left, through to four loudspeakers.

Step 1 — Capture (A-format). A source sits at azimuth $\theta = +45^\circ$ (front-left), $\phi = 0^\circ$ . The four capsules each output a tilted-cardioid pressure: the LF and LB capsules (left-leaning) see it strongest, RF and RB weakest. The raw four channels are A-format and look like four similar-but-shifted versions of the source plus the room.

Step 2 — A-to-B conversion. Run A-format through the calibrated converter (matrix + spacing filters). Using the sum/difference matrix and then the correction EQ, we obtain B-format. For a source at $\theta = 45^\circ, \phi = 0^\circ$ the ideal first-order coefficients (SN3D) are

W = 1,\quad Y = \sin 45^\circ \approx 0.707,\quad Z = 0,\quad X = \cos 45^\circ \approx 0.707.

So the scene encodes the front-left source as full omni plus equal positive $X$ (front) and $Y$ (left) dipole content, zero $Z$ — geometrically a unit vector pointing front-left in the horizontal plane. The room reflections arrive from many directions and populate $X, Y, Z$ with their own time-varying pattern; that is the enveloping ambience we wanted.

Step 3 — (Optional) rotate. Suppose the picture is later re-framed so "front" should move $30^\circ$ to the right. Apply the yaw matrix with $\psi = -30^\circ$ to $(X, Y)$ ; the source's encoded direction swings accordingly, without re-recording. Leave $\psi = 0$ for now.

Step 4 — Decode to a square. Place four speakers at $\theta = \pm 45^\circ$ (front L/R) and $\theta = \pm 135^\circ$ (rear L/R), all at $\phi = 0$ . A basic first-order (in-phase/max- $r_E$ ) decoder forms each speaker feed as

S_i = \tfrac{1}{4}\Bigl(W + \sqrt{2}\,(X\cos\theta_i + Y\sin\theta_i)\Bigr),

evaluated at each speaker azimuth $\theta_i$ . The $\sqrt{2}$ restores the dipole weighting relative to the SN3D-scaled $W$ . Plugging in our source coefficients ( $W=1, X=Y=0.707$ ):

Front-left speaker ( $\theta = 45^\circ$ ): $X\cos45^\circ + Y\sin45^\circ = 0.707(0.707)+0.707(0.707) = 1.0$ , so $S \propto W + \sqrt{2}(1.0)$ — the largest feed.
Front-right ( $\theta=-45^\circ$ ): $0.707(0.707) + 0.707(-0.707) = 0$ , so $S \propto W$ only — moderate.
Rear-left ( $\theta=135^\circ$ ): $0.707(-0.707)+0.707(0.707)=0$ — moderate.
Rear-right ( $\theta=-135^\circ$ ): $0.707(-0.707)+0.707(-0.707) = -1.0$ — smallest (and out of phase).

The front-left speaker dominates, the two adjacent speakers fill, the opposite corner is suppressed: the phantom image lands at front-left, exactly where the source was. The room ambience, carried by the fluctuating $X, Y, Z$ , spreads across all four speakers and envelops the listener. We have gone from a four-capsule cluster to a layout-specific four-speaker reproduction — and could just as easily have decoded the same B-format to headphones (binaural) or a dome by swapping only the decoder. That swap-the-decoder property is the entire point.

Common Mistakes and Limits

Common mistakes

Skipping or mismatching A-to-B calibration. Using the generic filter set instead of the per-unit calibration, or forgetting the spacing-correction filters entirely, smears high frequencies and tilts the scene. Always use the unit's calibration data.

Convention confusion (FuMa vs AmbiX, ACN vs WXYZ, SN3D vs N3D). As covered above, this is the number-one cause of rotated, mirrored or wrong-elevation scenes. Label every B-format file with its convention and convert deliberately.

Expecting high-order sharpness from a small/low-order mic. A first-order tetrahedral mic will never localize as sharply as a coincident XY pair on a fixed stereo system; that is physics, not a defect. Choose the order to match the sharpness you need.

Treating headline order as full-band. As the aliasing examples show, the advertised order holds only over a band. Do not expect fourth-order precision at $12\,\text{kHz}$ from a $4\,\text{cm}$ sphere.

Wind, handling and proximity. Closely spaced capsules and the differencing math make Ambisonic mics exquisitely sensitive to wind, low-frequency rumble and handling noise — these corrupt the delicate dipole differences. Use proper blimps and shockmounts outdoors, and high-pass cautiously (it interacts with the dipole low-frequency correction).

Placing the mic too close or in a bad room. Because the mic captures the whole field including all reflections, a poor room or excessive distance is captured faithfully — there is no directional "spotlight" to reject it. Distance and room choice matter as much as for any other technique (see distance and air and reverberation).

Limits

Resolution is bounded by order, which is bounded by physics. Self-noise sets a floor on usable low-frequency order; spatial aliasing sets a ceiling on usable high-frequency order. You cannot have arbitrarily sharp localization from a compact array across the full band.

Single point of view. A scene mic captures the field at one point. It captures direction beautifully but not the full distance/translation cues you would need to walk through the scene (6-degrees-of-freedom). True walkable audio needs multiple arrays or parametric reconstruction — a frontier topic beyond first-order capture.

Time-cue spaciousness is limited. Being a coincident (intensity) technique, it lacks the inter-channel time differences that give spaced arrays their enveloping width; first-order scenes can sound narrower or "in the head" compared with a good spaced 3D array until decoded to many speakers.

Self-noise and channel count grow fast. Higher order means many channels (a fourth-order file is $25$ channels), heavy encoding boosts, and storage/processing cost — a practical limit for live and broadcast pipelines.

Where this fits in the pipeline

A calibrated B-format scene is the natural feedstock for any Ambisonic renderer, including real-time immersive engines such as the DAM Audio ISE (Immersive Sound Engine, dam-audio.com/research/ise-immersive-sound-engine). Capture once as a layout-agnostic scene; rotate, mix and decode downstream to whatever the venue or headset demands.

Summary

Scene-based recording captures the sound field at a point as a set of spherical-harmonic channels — B-format — that commit to no loudspeaker layout until decode time. The tetrahedral microphone realizes first order with four cardioid capsules whose A-format output is matrixed and spacing-corrected into $W, X, Y, Z$ ; spherical arrays of $(N+1)^2$ -plus capsules extend this to higher orders with sharper localization at the cost of self-noise and a narrower aliasing-limited band. The whole approach is an intensity technique in the spirit of coincident stereo — direction encoded as a level pattern across channels — but lifted into a full 3D basis that can be rotated, zoomed and decoded to speakers, domes, WFS arrays or binaural headphones from a single recording. That deferral of the layout decision, and the exact post-hoc rotation it enables, is why scene-based capture is the backbone of VR, 360 video and any rotatable immersive content — and why, for fixed front-stage work, it still often plays a supporting role beside the sharper coincident and spaced techniques of the previous chapters.

References

Gerzon, M. A. (1975). "The Design of Precisely Coincident Microphone Arrays for Stereo and Surround Sound." Preprint, 50th AES Convention. The foundational paper on coincident sound-field capture and B-format.
Farrar, K. (1979). "Soundfield Microphone." Wireless World, parts 1–2. Engineering description of the tetrahedral capsule arrangement and A-to-B conversion.
Zotter, F. and Frank, M. (2019). Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality. Springer (open access). Comprehensive modern treatment of encoding, decoding, conventions and arrays.
Daniel, J. (2000). Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimédia. PhD thesis, Université Paris VI. Definitive HOA and near-field/decoding theory.
Bertet, S., Daniel, J., Parizet, E. and Warusfel, O. (2013). "Investigation on Localisation Accuracy for First and Higher Order Ambisonics Reproduction." Acta Acustica united with Acustica, 99(4). Order vs localization accuracy, including microphone capture.
Rumsey, F. (2001). Spatial Audio. Focal Press. Accessible overview placing Ambisonic capture among other spatial techniques.
Eargle, J. (2004). The Microphone Book, 2nd ed. Focal Press. Capsule directivity, coincident-array fundamentals and the M/S sum/difference logic generalized here.
Moreau, S., Daniel, J. and Bertet, S. (2006). "3D Sound Field Recording with Higher Order Ambisonics — Objective Measurements and Validation of a 4th-Order Spherical Microphone." Preprint, 120th AES Convention. Spherical-array encoding, aliasing limits and order-vs-frequency behaviour.

← Back to Recording

Why Capture a Whole Sound Field at a Point​

The point-field idea​

Capture and reproduction as duals​

The Tetrahedral Microphone​

Capsule geometry​

Why a tetrahedron, and how big​

A-Format to B-Format Conversion​

The ideal matrix​

Why filtering is unavoidable​

Worked example: deriving W and X​

First-Order Limits and Higher-Order Microphones​

How many channels per order​

Spherical microphone arrays​

Spatial Aliasing and the Frequency Range of Arrays​

The aliasing frequency​

Worked example: tetrahedral aliasing frequency​

Worked example: a sphere at higher order​

Processing: Calibration, Conventions, and Rotation​

Calibration​

B-format conventions: ACN and SN3D​

Rotation, tilt, and zoom in post​

Binaural and Loudspeaker Decoding of Recorded Scenes​

Loudspeaker decoding​

Binaural decoding​

Pros and Cons vs Coincident, Spaced, and Main Arrays​

When scene-based capture wins​

When the older techniques win​

Worked Example: From Tetrahedral Capture to a Decoded Square​

Common Mistakes and Limits​

Common mistakes​

Limits​

Summary​

References​