Ambisonics

Every spatialization method in this part of the guide shares the same skeleton: encode a sound scene into a finite representation, then decode that representation to whatever loudspeakers or headphones happen to be present. Amplitude panning (see /guide/techniques/amplitude-panning) makes the encode and the decode the same step — you choose the speaker gains directly. Object-based audio (see /guide/techniques/object-based) keeps positions abstract and defers the decode to a renderer. Ambisonics takes the abstraction one step further than either: it does not store sources or even directions, it stores the sound field itself, sampled at a single point and expanded into a series of spatial basis functions. Nothing in that representation knows how many speakers you own or where they are. The speaker layout enters only at decode time, through a matrix.

That single design decision — describe the field, not the channels — is what makes Ambisonics rotatable, layout-independent, and the natural carrier for head-tracked virtual reality and 360-degree video. It is also what makes it subtle: the quality you hear depends far more on the decoder than on the recording, and the elegant mathematics hides a number of practical traps. This chapter builds the whole system from first principles, with worked numbers at each step.

The scene-based idea: describe the field at a point

Imagine you stand at one spot in a concert hall and ask a precise physical question: what is the acoustic pressure, as a function of time, at my exact location, and from which directions is the sound arriving? The pressure at the point is a single number, but the directional information — sound coming from the violins on the left, the timpani behind, the reflection off the ceiling — lives on the sphere of directions surrounding the point.

A function defined on the surface of a sphere can be expanded in a series, exactly as a function on a circle can be expanded in a Fourier series of sines and cosines. The natural basis functions on the sphere are the spherical harmonics $Y_{n}^{m}(\theta, \phi)$ , indexed by an order $n = 0, 1, 2, \dots$ and a degree $m$ running from $-n$ to $+n$ . Direction is given by azimuth $\theta$ and elevation $\phi$ (conventions vary; here azimuth is measured counter-clockwise from front, elevation up from the horizon). Any directional sound field $f(\theta,\phi)$ can be written

f(\theta,\phi) = \sum_{n=0}^{\infty} \sum_{m=-n}^{n} B_{n}^{m}\, Y_{n}^{m}(\theta,\phi).

The coefficients $B_{n}^{m}$ are the Ambisonic signals. They are ordinary audio channels — one mono waveform per coefficient. The order-zero term $B_{0}^{0}$ is just the omnidirectional pressure (a constant on the sphere); the first-order terms describe a left-right, front-back and up-down gradient; higher orders carve the sphere into finer and finer lobes.

This is the scene-based paradigm. The representation is independent of the reproduction system in the same way that a Fourier spectrum is independent of the loudspeaker that will eventually play it. Truncating the infinite sum at some finite order $N$ gives a finite, fixed channel count that approximates the field with a spatial resolution set by $N$ . Everything that follows is consequence and engineering: how to pick the basis precisely (conventions), how to put a source into the coefficients (encoding), and how to get loudspeaker feeds back out (decoding).

Why a point expansion can fill a room

A reasonable worry: if we only describe the field at one point, how can a whole audience hear it correctly? The honest answer is that we cannot, exactly — the field is reconstructed perfectly only in a small region around the expansion point, and that region shrinks as frequency rises. This is the sweet spot problem in its most rigorous form, and we return to it under Limits. For now, hold the idea that order $N$ buys you both finer angular detail and a larger region of valid reconstruction.

First-order B-format: W, X, Y, Z

The historical and conceptual core of Ambisonics is first-order ( $N=1$ ), classically called B-format, with four channels named W, X, Y, Z.

$W$ is the order-zero coefficient: the omnidirectional pressure. A pressure microphone (zero directivity) captures it. It answers "how much total sound is here."
$X$ , $Y$ , $Z$ are the three first-order coefficients: the components of the particle-velocity vector, equivalently the outputs of three figure-of-eight (bidirectional) microphones pointing along the front-back, left-right and up-down axes. $X$ is positive for sound from the front and negative from the rear; $Y$ positive from the left; $Z$ positive from above.

The directivity patterns make the picture concrete. A figure-of-eight pointing forward has gain $\cos\theta\cos\phi$ as a function of direction — unity at the front, zero at the sides, $-1$ at the rear. So the four classic patterns, as functions of source direction, are

W = 1, \qquad X = \cos\theta\cos\phi, \qquad Y = \sin\theta\cos\phi, \qquad Z = \sin\phi .

These are precisely the (suitably scaled) spherical harmonics of orders 0 and 1. The omni $W$ is $Y_0^0$ ; the three dipoles are $Y_1^{1}, Y_1^{-1}, Y_1^{0}$ . The "harmonic picture by order" is therefore literal: order 0 is a monopole (a sphere of constant value), order 1 is a set of dipoles (each a pair of opposite lobes), order 2 is quadrupoles (four-lobed clover patterns), and so on. Each added order multiplies the number of lobes and so the angular sharpness with which a direction can be represented.

What each channel does perceptually

The pressure channel $W$ carries the energy; the velocity channels $X, Y, Z$ carry the direction. The relation between them is the acoustic intensity, $\mathbf{I} \propto W \cdot (X, Y, Z)$ , which points in the direction of net energy flow. The auditory system localizes largely by interaural cues derived from exactly this kind of directional energy flow at low frequencies (see /guide/fundamentals/psychoacoustics). This is why first-order Ambisonics, despite its coarseness, already produces a usable sense of direction: it gets the first-order intensity vector right at the listening point.

The historical thread

Ambisonics grew out of work by Michael Gerzon and colleagues at Oxford in the early 1970s, generalizing Blumlein's stereo (see /guide/techniques/stereo) and Duane Cooper's matrix ideas into a coherent, mathematically grounded, periphonic (full-sphere) system. Gerzon's insight was to treat reproduction as the recovery of the pressure and velocity at the listener, and to derive speaker gains that make the reconstructed intensity vector match the original. The commercial UHJ carrier and the Soundfield microphone came from the same circle. First-order remained the practical ceiling for decades because four channels were all that storage and microphones could comfortably deliver; the higher-order generalization came later, principally through the doctoral work of Jérôme Daniel.

Higher orders: channel count, resolution and usable area

Truncating the spherical-harmonic series at order $N$ keeps all coefficients with $n \le N$ . The number of $(n,m)$ pairs is

\text{channels} = \sum_{n=0}^{N} (2n+1) = (N+1)^2 .

So the channel count is $1, 4, 9, 16, 25, \dots$ for orders $0, 1, 2, 3, 4, \dots$ This is full-sphere (periphonic) counting. If you only care about the horizontal plane, you keep just the $|m|=n$ "sectoral" harmonics and the count is $2N+1$ instead — a cheaper, 2D variant sometimes used for horizontal-only rigs.

Order controls angular resolution. A useful rule of thumb is that an order- $N$ system can synthesize an angular feature roughly as narrow as $\sim 360^\circ / (2N+2)$ , so each order sharpens the effective "spotlight" with which a source is painted onto the sphere. Equivalently, the half-width of the main lobe of a max- $r_E$ panning function scales roughly as $137^\circ/(N+1)$ .

Order also controls the usable area — the radius around the centre within which the field is faithfully reconstructed. The reconstruction is accurate up to a sphere whose radius $r$ satisfies approximately

k r \lesssim N, \qquad k = \frac{2\pi f}{c},

where $k$ is the acoustic wavenumber, $f$ frequency and $c \approx 343\ \text{m/s}$ the speed of sound. Below this radius the truncated series matches the true field; above it, errors grow.

Key takeaway

This single inequality ties together order, frequency and the size of the listening region, and it is the most important quantitative fact about higher-order Ambisonics.

Worked example: how big is the sweet spot?

Take a human head, half-width about $r = 0.09\ \text{m}$ (the spacing between the ears). At what frequency does the reconstruction start to fail for each order? Set $kr = N$ :

f = \frac{N c}{2\pi r} = \frac{N \cdot 343}{2\pi \cdot 0.09} \approx N \times 607\ \text{Hz}.

So first order ( $N=1$ ) reconstructs the field accurately across the head only up to roughly $600\ \text{Hz}$ ; third order up to about $1.8\ \text{kHz}$ ; seventh order up to about $4.2\ \text{kHz}$ . Above those frequencies the two ears no longer sit inside the valid region and the spatial image relies on the more forgiving energy-based behaviour rather than exact reconstruction. This is the rigorous reason higher order helps and why even seventh order does not give pinpoint high-frequency imaging over a large audience.

The table summarizes the trade.

Order $N$	Channels $(N+1)^2$	Horizontal-only $2N+1$	Approx. lobe half-width	Faithful up to ( $r=0.09$ m)
0	1	1	omni	—
1	4	3	$\sim 69^\circ$	$\sim 0.6$ kHz
2	9	5	$\sim 46^\circ$	$\sim 1.2$ kHz
3	16	7	$\sim 34^\circ$	$\sim 1.8$ kHz
4	25	9	$\sim 27^\circ$	$\sim 2.4$ kHz
5	36	11	$\sim 23^\circ$	$\sim 3.0$ kHz
7	64	15	$\sim 17^\circ$	$\sim 4.2$ kHz

The headline cost is the quadratic growth of the channel count: doubling the order roughly quadruples the storage, the transmission bandwidth and the number of convolutions needed for binaural rendering. Production formats therefore cluster at orders 1, 3 and (occasionally) 5–7; orders beyond that appear mainly in research and in DAM Audio's own renderers.

Conventions: ordering, normalization and the format wars

The spherical harmonics are only defined up to two arbitrary choices: what order you list the channels in, and how you scale each one. Different communities chose differently, and the resulting incompatibilities have caused more confusion in Ambisonic practice than any acoustic subtlety. A file is just $(N+1)^2$ channels of audio; without knowing the ordering and normalization you literally cannot interpret it. Interchange therefore requires an explicit convention.

Channel ordering: ACN

The modern standard is ACN (Ambisonic Channel Number), which assigns each $(n,m)$ a single index

\text{ACN} = n^2 + n + m .

So $W$ is ACN 0; the first-order channels are ACN 1, 2, 3 for $(n,m) = (1,-1), (1,0), (1,1)$ , i.e. $Y, Z, X$ in the old letter names. Note the reordering: ACN puts $Y$ before $Z$ before $X$ , which trips up everyone migrating from the historical W, X, Y, Z layout. Order 2 occupies ACN 4–8, order 3 ACN 9–15, and so on, filling each order's block contiguously.

Normalization: N3D vs SN3D

The harmonics can be scaled so that they are fully orthonormal over the sphere (each integrates to unit energy) — this is N3D — or so that the maximum absolute value within each order is held in a fixed relationship — this is SN3D (Schmidt semi-normalized). The two are related by a per-channel factor that depends only on the order:

Y^{\text{N3D}}_{n,m} = \sqrt{2n+1}\; Y^{\text{SN3D}}_{n,m}.

N3D is mathematically tidiest (it makes the decoder formulae cleanest because the basis is orthonormal). SN3D keeps all channels in a comparable amplitude range, which is gentler on fixed-point hardware and on metering, and it is the choice of the dominant practical format. A third, older scheme, FuMa (Furse-Malham), is a semi-normalized convention specific to first and higher orders that additionally scales $W$ by $1/\sqrt{2}$ — a legacy of making the four B-format channels carry comparable energy on tape.

The two combinations you will actually meet

In practice two bundles dominate:

AmbiX: ACN ordering + SN3D normalization. This is the de-facto modern interchange format, used by VR/360 video pipelines, game engines and most current tools.
FuMa (sometimes called "classic" or Gerzon/Malham): the historical W, X, Y, Z ordering with the FuMa normalization, capped at third order. You meet it in older Soundfield recordings and legacy software.

Common mistake

Converting between them is a fixed, per-channel gain plus a reordering — a diagonal matrix and a permutation, no signal processing. But you must do it, and getting a single channel's sign or scale wrong silently rotates or mirrors the whole scene. Always carry the convention with the file.

When DAM Audio's Immersive Sound Engine ingests an Ambisonic bed, the very first thing it does is normalize everything to a single internal convention (ACN/SN3D), precisely to avoid these errors downstream.

Convention	Ordering	Normalization	$W$ scaling	Typical use
AmbiX	ACN	SN3D	none	VR, 360 video, modern tools, interchange
N3D variant	ACN	N3D	none	DSP/research, clean decoder maths
FuMa	W,X,Y,Z (classic)	FuMa (semi)	$\times 1/\sqrt 2$	Legacy Soundfield, old software

Encoding a mono source

To place a single mono signal $s(t)$ at direction $(\theta_s, \phi_s)$ you simply evaluate every basis function at that direction and use those values as the per-channel gains:

B_{n}^{m}(t) = s(t)\; Y_{n}^{m}(\theta_s, \phi_s).

That is the entire encoder: a vector of $(N+1)^2$ scalar gains, one per Ambisonic channel, multiplied by the source. There is no panning law to invent and no speaker layout involved — the source is written into the field exactly as a real source in that direction would write itself into the coefficients. This is the most elegant encoder in all of spatial audio, and it is why object panners can target Ambisonics so cheaply: an order-3 encoder is 16 multiplies per sample per object.

The first-order (SN3D, in classic letters) gains are

W = 1, \quad X = \cos\theta_s\cos\phi_s, \quad Y = \sin\theta_s\cos\phi_s, \quad Z = \sin\phi_s .

Worked example: a source at 90 degrees, on the horizon

Place a source hard left — azimuth $\theta_s = 90^\circ$ , elevation $\phi_s = 0^\circ$ — and feed it a signal of amplitude 1. Evaluate:

W = 1,

X = \cos 90^\circ \cos 0^\circ = 0\times 1 = 0,

Y = \sin 90^\circ \cos 0^\circ = 1 \times 1 = 1,

Z = \sin 0^\circ = 0 .

So only $W$ and $Y$ are non-zero, both equal to 1; $X$ and $Z$ are exactly zero. This makes complete physical sense: a sound from the left produces full pressure ( $W$ ), a full positive left-right velocity ( $Y$ ), no front-back velocity ( $X=0$ , the source is at the side where the figure-of-eight has its null), and no vertical velocity ( $Z=0$ , it is on the horizon). The pair $(W, Y) = (1, 1)$ encodes "all the energy is flowing from the left," which is exactly what we wanted.

warning

In AmbiX/ACN terms, the non-zero channels are ACN 0 ( $W$ ) and ACN 1 ( $Y$ ); if you used N3D, the $Y$ channel would carry $\sqrt{3}\approx 1.732$ instead of 1, which is why you must know the normalization before you compare numbers.

Encoding to higher order

At higher order the same source also lights up the second-order channels through $Y_2^m(90^\circ, 0^\circ)$ , third-order channels through $Y_3^m$ , and so on. Each added order narrows the angular "spotlight" the source casts when later decoded, but the encoder is always the same trivial evaluation of the basis at the source direction. A moving source simply has time-varying gains; to avoid zipper noise the gains are interpolated, exactly as in amplitude panning.

Decoding: from field coefficients to loudspeaker feeds

Decoding is where Ambisonics earns or loses its reputation. Given $L$ loudspeakers at directions $(\theta_\ell, \phi_\ell)$ , we want a decoder matrix $\mathbf{D}$ of size $L \times (N+1)^2$ that maps the Ambisonic signal vector $\mathbf{b}$ to loudspeaker feeds $\mathbf{g}$ :

\mathbf{g} = \mathbf{D}\,\mathbf{b}.

The decoder is computed once for a given rig and then applied as a fixed matrix multiply per sample. The art is in choosing $\mathbf{D}$ . Several principled recipes exist, and they trade off differently.

Sampling (projection) decoders

The simplest idea: treat each loudspeaker as a virtual "listening direction" and feed it the field evaluated in that direction. The decoder row for speaker $\ell$ is just the spherical harmonics sampled there:

D_{\ell, (n,m)} = \frac{1}{L}\, Y_n^m(\theta_\ell, \phi_\ell).

This is the sampling or projection decoder (sometimes SAD, sampling Ambisonic decoder). It is exact and well-behaved when the speakers are arranged on a near-uniform spherical sampling — a Platonic solid, a t-design — because then the harmonics are nearly orthogonal over the speaker set. On irregular or partial rigs (a typical cinema array with a dense front and sparse rear, or a horizontal-only ring) it produces uneven loudness and smeared images, because the sampling assumption is violated.

Mode-matching (pseudo-inverse) decoders

A more general approach demands that re-encoding the loudspeaker feeds reproduce the original coefficients. Let $\mathbf{Y}$ be the $(N+1)^2 \times L$ matrix whose columns are the harmonics evaluated at the speaker directions. We want $\mathbf{Y}\,\mathbf{g} = \mathbf{b}$ for any field, which gives the mode-matching decoder as the pseudo-inverse

\mathbf{D} = \mathbf{Y}^{+} = \mathbf{Y}^{\mathsf{T}}\big(\mathbf{Y}\mathbf{Y}^{\mathsf{T}}\big)^{-1}.

For a uniform layout this collapses to the sampling decoder, but for irregular layouts it does the right thing in a least-squares sense. Its danger is numerical: if the layout has large gaps the matrix $\mathbf{Y}\mathbf{Y}^{\mathsf{T}}$ becomes ill-conditioned and the decoder produces wild gains, pumping energy into speakers to cancel where there is no coverage. Regularization or constraints are then needed.

Energy-preserving and AllRAD decoders

For real, imperfect rigs the modern workhorses are energy-preserving decoders (EPAD) and AllRAD (All-Round Ambisonic Decoding, Zotter and Frank). The goal shifts from exact field reconstruction — impossible over an audience anyway — to keeping the total energy of a panned source constant as it moves around the sphere, so loudness does not pump, and keeping the energy localized in the right direction.

AllRAD is especially elegant and robust. It decodes the Ambisonic field onto a large set of virtual loudspeakers arranged on an ideal, uniform spherical grid (a t-design) using a clean sampling decoder, then renders each virtual speaker onto the real loudspeakers with VBAP (vector-base amplitude panning; see /guide/techniques/amplitude-panning). Because the virtual grid is uniform the Ambisonic step is well-conditioned, and because VBAP handles the real layout the result is robust to gaps, irregularity and partial coverage. AllRAD has become the default for irregular cinema and installation rigs precisely because it degrades gracefully.

Dual-band decoding: velocity vs energy vectors

The deepest practical idea in Ambisonic decoding is that the ear localizes by different cues at low and high frequencies, so the decoder should behave differently in two bands (see /guide/fundamentals/psychoacoustics).

Two diagnostic vectors capture this. Given speaker gains $g_\ell$ and unit direction vectors $\mathbf{u}_\ell$ , the velocity vector (the Makita/low-frequency localization vector) is

\mathbf{r}_V = \frac{\sum_\ell g_\ell \mathbf{u}_\ell}{\sum_\ell g_\ell},

and the energy vector (high-frequency localization vector) is

\mathbf{r}_E = \frac{\sum_\ell g_\ell^2 \,\mathbf{u}_\ell}{\sum_\ell g_\ell^2}.

At low frequencies, where the wavelength is large compared with the head and the field is genuinely reconstructed, the ear follows the velocity vector; a good decoder makes $\mathbf{r}_V$ point exactly at the intended source with length 1 (a basic or velocity decoder achieves this). At high frequencies, where reconstruction fails and only energy summation is meaningful, the ear follows the energy vector; here we want $\mathbf{r}_E$ to point in the right direction and to be as long as possible, because $|\mathbf{r}_E|$ near 1 means the energy is tightly concentrated and the image is sharp, while a short $\mathbf{r}_E$ means the energy is spread over many speakers and the image is diffuse.

You cannot maximize both with one set of gains, so practical decoders are dual-band: a crossover (around 400 Hz to 1 kHz, scaled with order and rig size) splits the signal, a basic decoder handles the low band to nail $\mathbf{r}_V$ , and a max- $r_E$ decoder handles the high band to maximize $|\mathbf{r}_E|$ . The max- $r_E$ decoder is obtained by multiplying each order's channels by a weight $a_n$ chosen to maximize the energy-vector length; for order $N$ the optimal weights are

a_n = P_n\!\left(\cos\frac{137.9^\circ}{N+1}\right),

where $P_n$ is the Legendre polynomial of order $n$ . The two bands are also gain-matched so the overall loudness is flat. This basic/max- $r_E$ dual-band design, due in its modern form to work by Gerzon, Daniel, Heller and others, is why a well-built order-3 decoder sounds dramatically better than a naive one on the same signal.

Worked example: first-order on a 4-speaker square

Put four speakers on the horizontal plane at azimuths $\pm 45^\circ, \pm 135^\circ$ (a square), and decode a first-order signal encoding a source at the front, $\theta_s = 0^\circ$ . The encoded horizontal channels are $W=1, X=\cos 0 = 1, Y=\sin 0 = 0$ . A sampling decoder feeds speaker $\ell$ the value $\tfrac{1}{4}(W + 2X\cos\theta_\ell + 2Y\sin\theta_\ell)$ (the factor 2 restores the dipole energy under SN3D). For the two front speakers at $\pm 45^\circ$ , $\cos\theta_\ell = \cos 45^\circ = 0.707$ , so each gets $\tfrac14(1 + 2\cdot1\cdot0.707) = \tfrac14(1+1.414) = 0.604$ . For the two rear speakers at $\pm135^\circ$ , $\cos\theta_\ell = -0.707$ , so each gets $\tfrac14(1 - 1.414) = -0.104$ . The two front speakers dominate and the rears get a small negative (anti-phase) feed — the classic first-order signature. Compute the energy vector: numerator $\sum g_\ell^2 \mathbf{u}_\ell$ is dominated by the two front speakers, and working it through gives $|\mathbf{r}_E| \approx 0.71$ , pointing forward. That sub-unity length is the quantitative statement of first-order's blurry imaging: even with perfect gains the energy is spread across more than one speaker. Re-running with the max- $r_E$ weight $a_1 = \cos(137.9^\circ/2) = \cos 68.95^\circ = 0.359$ applied to $X,Y$ lengthens and stabilizes $\mathbf{r}_E$ across directions, trading a little forward sharpness for uniformity all around the circle.

Why decoder design dominates quality

Notice what the encoder did not decide: loudness uniformity, image sharpness, the front-back balance, the crossover, the behaviour on your particular speakers. All of that lives in the decoder.

Key lesson

Two engineers can play the identical Ambisonic file and one hears a crisp, stable, even soundfield while the other hears a lopsided smear, purely because of the decoder matrix and its dual-band weighting. This is the single most important practical lesson of the format and the reason tools expose so many decoder options.

Rotation: turning the whole field with a matrix

Here is the property that made Ambisonics the carrier of choice for VR and 360 video. Because the representation is the field itself in a direction-symmetric basis, rotating the entire scene is a linear operation on the channels — no re-panning of individual sources, no knowledge of what is in the scene at all.

A rotation acts within each order independently: order- $n$ channels are mixed only among themselves by an $(2n+1)\times(2n+1)$ rotation matrix $\mathbf{R}_n$ (a Wigner-D matrix), and the full rotation is block-diagonal:

\mathbf{b}' = \mathbf{R}(\alpha,\beta,\gamma)\,\mathbf{b}, \qquad \mathbf{R} = \mathrm{diag}\big(\mathbf{R}_0, \mathbf{R}_1, \dots, \mathbf{R}_N\big).

Order 0 ( $W$ ) is invariant — a single number, untouched by rotation, $\mathbf{R}_0 = 1$ . For first order, a yaw (head turn) by angle $\alpha$ is just a 2D rotation of the $X$ and $Y$ channels:

\begin{bmatrix} X' \\ Y' \end{bmatrix} = \begin{bmatrix} \cos\alpha & \sin\alpha \\ -\sin\alpha & \cos\alpha \end{bmatrix} \begin{bmatrix} X \\ Y \end{bmatrix}, \qquad Z' = Z .

Worked example: head turn of 30 degrees

A listener with a head tracker turns 30 degrees to the left. The renderer must counter-rotate the field by $\alpha = -30^\circ$ so the scene stays fixed in the world. Take a source that was encoded dead ahead, $X=1, Y=0$ . After the counter-rotation:

X' = \cos(-30^\circ)\cdot 1 + \sin(-30^\circ)\cdot 0 = 0.866,

Y' = -\sin(-30^\circ)\cdot 1 + \cos(-30^\circ)\cdot 0 = 0.5 .

The pair $(X', Y') = (0.866, 0.5)$ encodes a source at azimuth $\arctan(0.5/0.866) = 30^\circ$ — i.e. now 30 degrees to the listener's right, exactly compensating the head turn so the source stays put in the room. Crucially this cost two multiply-adds and did not depend on how many sources were in the field, whether it was synthetic or microphone-recorded, or how complex the scene was. For a 360 video with a recorded ambient bed of hundreds of effective sources, head-tracked rotation is still just one small matrix multiply per sample — this is why every VR audio pipeline standardizes on Ambisonics for the head-locked-to-world transform, and why DAM Audio's renderers perform tracker compensation in the Ambisonic domain before the binaural decode.

Near-field control and distance: NFC-HOA

The plain encoder above models every source as a plane wave — infinitely far away, arriving from a direction with no curvature to its wavefront. Real nearby sources radiate spherical waves whose curvature carries distance information, and reproducing that curvature is what lets a source appear inside the room rather than on a distant sphere. Near-Field Compensated Higher-Order Ambisonics (NFC-HOA), formalized by Daniel, extends the theory to finite source distance.

The mathematics introduces, for each order $n$ , a radial filter $F_n(kr)$ built from spherical Hankel functions that encodes the wavefront curvature for a source at distance $r$ . The crucial and troublesome fact is that these near-field encoding filters have infinite bass boost: as frequency $\to 0$ , the order- $n$ filter gain grows like

|F_n(kr)| \sim \left(\frac{c}{2\pi f\, r}\right)^{n},

so a first-order near-field component rises at 6 dB/octave toward DC, a second-order at 12 dB/octave, a third-order at 18 dB/octave, and so on.

The bass-boost problem

Encoding a literally close source would demand unbounded sub-bass and overflow any channel. This is the bass-boost problem, and it is fundamental, not a bug: the curvature of a nearby spherical wave genuinely has enormous low-frequency content relative to a plane wave.

The cure is to never store the raw near-field signal but rather a near-field-compensated signal: the encoder applies the source-distance filter $F_n(kr_s)$ and the decoder applies the inverse loudspeaker-distance filter $F_n^{-1}(kr_{\text{spk}})$ , and the two stable, finite combinations are stored. The stored channels stay bounded because they only ever carry the ratio of source curvature to speaker curvature. The decoder's compensating filters depend on the actual loudspeaker radius, which is one more reason the decode is layout-specific. In practice many productions sidestep all this by staying with plane-wave (far-field) encoding and conveying distance through level, air absorption and reverberation instead (see /guide/field-and-room/distance-and-air and /guide/field-and-room/reverberation), reserving NFC-HOA for cases where genuine wavefront curvature and very-near sources matter.

Capture: Ambisonic microphones and A-to-B conversion

So far we have synthesized B-format by encoding mono sources. The other path is to record a real soundfield directly. You cannot physically build a perfect coincident set of one omni and three figure-of-eights, so Ambisonic microphones use an array of near-coincident capsules and convert.

The classic first-order Ambisonic microphone is the tetrahedral array: four cardioid (or sub-cardioid) capsules arranged on the faces of a tetrahedron, pointing out symmetrically. Their raw outputs are called A-format — four capsule signals, each a directional pickup, none of them a clean harmonic. A fixed linear matrix converts A-format to B-format (W, X, Y, Z). For capsules labelled by their pointing directions, the conversion sums and differences them:

W = \tfrac12(F_{\text{LFU}} + F_{\text{RFD}} + F_{\text{LBD}} + F_{\text{RBU}}),

X = \tfrac12(F_{\text{LFU}} + F_{\text{RFD}} - F_{\text{LBD}} - F_{\text{RBU}}),

and similarly for $Y$ and $Z$ with the appropriate sign patterns (the labels denote Left/Right, Front/Back, Up/Down). Summing all four capsules cancels their directional components and leaves pressure ( $W$ ); differencing the front-pointing from the back-pointing pair leaves the front-back gradient ( $X$ ); and so on. Because the capsules are not perfectly coincident, the conversion also includes frequency-dependent correction filters to compensate for the physical capsule spacing and to flatten the directional response; the spacing sets an upper frequency beyond which the conversion smears, mirroring the $kr \lesssim N$ limit from a microphone rather than a listener standpoint.

Higher-order Ambisonic microphones use more capsules on a rigid sphere — for example a 32-capsule array delivers up to fourth order — with an analogous but larger, frequency-dependent encoding matrix, and they carry their own low-frequency and high-frequency limits set by the sphere radius and capsule count. The detailed engineering of microphone choice, placement and calibration belongs to the recording and calibration parts of this guide, named here but covered there; the point for this chapter is that capture produces exactly the same B-format coefficients that synthesis does, so a recorded ambience and a panned object live in one common representation and can be summed, rotated and decoded together.

Binaural decoding of Ambisonics

Most listeners hear Ambisonics not on a speaker dome but on headphones, which means the field must be decoded to two ears. The construction is a special, beautiful case of the general decoder: render to a set of virtual loudspeakers on an ideal grid, then replace each virtual speaker with a pair of head-related transfer functions (HRTFs) for its direction (see /guide/techniques/binaural).

Concretely, pick $V$ virtual loudspeaker directions forming a good spherical grid, build an ordinary (ideally max- $r_E$ ) Ambisonic decoder $\mathbf{D}$ to them, and pre-convolve each virtual speaker with its left/right HRTF $H_{L,v}, H_{R,v}$ . Because all the speaker-domain operations are linear, the whole chain collapses into a single pair of per-channel Ambisonic-to-ear filters: the binaural rendering of the entire field is

\text{Ear}_{L} = \sum_{(n,m)} b_{n}^{m} \ast h_{L}^{(n,m)}, \qquad h_{L}^{(n,m)} = \sum_{v} D_{v,(n,m)}\, H_{L,v},

and likewise for the right ear. There are only $(N+1)^2$ such filters per ear, computed once, so binaural decoding costs $2(N+1)^2$ convolutions regardless of how busy the scene is — 8 convolutions at first order, 32 at third order. This fixed cost, combined with the cheap rotation matrix in front of it, is exactly the architecture behind head-tracked VR audio: rotate the field by the tracker, then convolve with the fixed Ambisonic-HRTF filter bank. The perceptual quality then depends on the HRTF set, the virtual grid density and, once again, the decoder weighting — diffuse-field equalization and max- $r_E$ weighting are applied to keep the timbre neutral and the image as sharp as the order allows. DAM Audio's Immersive Sound Engine uses precisely this virtual-speaker-plus-HRTF route for its headphone monitoring path.

Pros, cons and transcoding

What Ambisonics buys you

Layout independence. One master encodes the scene; decoders target stereo, 5.1, 7.1.4, a 24-speaker dome or headphones from the same channels. The scene-based promise of "encode once, decode anywhere" is real here in a way it is not for channel-based formats.
Rotatability. The whole field turns with one block-diagonal matrix, which is why it dominates VR and 360 video and head-tracked binaural.
Mathematical elegance and composability. Encoding is a basis evaluation, mixing is addition, rotation and decoding are matrix multiplies. Recorded and synthetic content combine in one representation.
Graceful scalability. Raise the order for more resolution without changing the paradigm; drop channels to transcode down.

Where it hurts

Sweet-spot shrink. Exact reconstruction holds only within $kr \lesssim N$ , so off-centre listeners and high frequencies suffer; a large audience never all sits in the valid region. Channel-based and object-based approaches can be tuned per-speaker to spread coverage more evenly.
Channel cost. $(N+1)^2$ grows quadratically; high orders are expensive to store, transmit and convolve.
Decoder sensitivity. Quality is dominated by a component the listener often does not control; a bad decoder ruins a good recording, and irregular rigs demand sophisticated decoders (AllRAD, EPAD) to sound even.
Coloration and blur at high frequency. Where the field cannot be reconstructed, summing many speakers tints the timbre and widens images; dual-band decoding mitigates but does not eliminate this.

Transcoding between orders and formats

Because higher orders simply add channels, order reduction is truncation: drop the high-order channels and you have a lower-order, lower-resolution but valid field — an order-5 master plays through an order-1 decoder by keeping only the first four channels. Order increase cannot invent detail; you can re-encode known objects at higher order, but you cannot recover resolution that was never captured. Format conversion (AmbiX to FuMa, N3D to SN3D) is the fixed diagonal-scale-plus-permutation discussed earlier — lossless and instantaneous, but mandatory and unforgiving of sign errors. And a complete Ambisonic master can itself be transcoded to other paradigms: decode it to a 7.1.4 channel bed for a channel-based deliverable, or convert it to binaural for headphones, all from the one scene-based source. This interoperability — Ambisonics as a neutral interchange and processing hub that feeds channel-based, object-based and binaural outputs alike — is increasingly how immersive production pipelines, including DAM Audio's, are organized.

Limits

Three limits deserve to be stated plainly, because they are physical, not implementational.

First, the truncation limit $kr \lesssim N$ : a finite order reconstructs the field only inside a sphere that shrinks with frequency. No decoder beats this; it is set by how many spatial modes you kept. Worked earlier: first order is faithful across a head only to about 600 Hz.

Second, the energy-vector ceiling: above the reconstruction band the best you can do is concentrate energy, and $|\mathbf{r}_E|$ is bounded below 1 for finite order, so high-frequency images are inherently softer-edged than a single real loudspeaker. More order raises the ceiling but never to a point.

Third, single-point expansion: the whole edifice describes the field at one location. Sources at different distances are disambiguated only through NFC-HOA's curvature or through level and reverberation cues; without them, everything sits on a single radius. Wave Field Synthesis (see /guide/techniques/wfs) takes the opposite stance — synthesize the field over an extended area using many sources — and is the natural comparison when a large, walkable listening zone matters.

note

Understanding these limits is what separates using Ambisonics from merely invoking it. The format's elegance is genuine, but it is the decoder, the order and the conventions — not the recording alone — that determine what the listener finally hears.

For the perceptual reasoning behind the velocity/energy split and the localization cues the decoder is optimizing, return to /guide/fundamentals/psychoacoustics; for how this technique sits among the others, see the overview at /guide/techniques.

References

Gerzon, M. A. (1973). "Periphony: With-Height Sound Reproduction." Journal of the Audio Engineering Society, 21(1), 2–10.
Gerzon, M. A. (1985). "Ambisonics in Multichannel Broadcasting and Video." Journal of the Audio Engineering Society, 33(11), 859–871.
Daniel, J. (2000). Représentation de champs acoustiques, application à la transmission et à la reproduction de scènes sonores complexes dans un contexte multimédia. PhD thesis, Université Paris VI.
Daniel, J. (2003). "Spatial Sound Encoding Including Near Field Effect: Introducing Distance Coding Filters and a Viable, New Ambisonic Format." AES 23rd International Conference, Copenhagen.
Zotter, F., & Frank, M. (2019). Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality. Springer Open.
Zotter, F., & Frank, M. (2012). "All-Round Ambisonic Panning and Decoding." Journal of the Audio Engineering Society, 60(10), 807–820.
Heller, A. J., Lee, R., & Benjamin, E. M. (2008). "Is My Decoder Ambisonic?" AES 125th Convention, San Francisco.
Pulkki, V. (1997). "Virtual Sound Source Positioning Using Vector Base Amplitude Panning." Journal of the Audio Engineering Society, 45(6), 456–466.
Rumsey, F. (2001). Spatial Audio. Focal Press, Oxford.

← Back to Spatialization Techniques

The scene-based idea: describe the field at a point​

Why a point expansion can fill a room​

First-order B-format: W, X, Y, Z​

What each channel does perceptually​

The historical thread​

Higher orders: channel count, resolution and usable area​

Worked example: how big is the sweet spot?​

Conventions: ordering, normalization and the format wars​

Channel ordering: ACN​

Normalization: N3D vs SN3D​

The two combinations you will actually meet​

Encoding a mono source​

Worked example: a source at 90 degrees, on the horizon​

Encoding to higher order​

Decoding: from field coefficients to loudspeaker feeds​

Sampling (projection) decoders​

Mode-matching (pseudo-inverse) decoders​

Energy-preserving and AllRAD decoders​

Dual-band decoding: velocity vs energy vectors​

Worked example: first-order on a 4-speaker square​

Why decoder design dominates quality​

Rotation: turning the whole field with a matrix​

Worked example: head turn of 30 degrees​

Near-field control and distance: NFC-HOA​

Capture: Ambisonic microphones and A-to-B conversion​

Binaural decoding of Ambisonics​

Pros, cons and transcoding​

What Ambisonics buys you​

Where it hurts​

Transcoding between orders and formats​

Limits​

References​