Binaural — Headphone Spatialization

Every other technique in this part of the guide asks the same final question: how do we move air at the listener's two ears so that the brain reconstructs a position in space? Stereo, amplitude panning, surround, Ambisonics and Wave Field Synthesis all answer it indirectly: they drive a set of loudspeakers and trust the room and the head to deliver the right pressures to the eardrums. Binaural synthesis answers it directly. It models the path from a source to each eardrum and reproduces the two resulting signals — usually over headphones, where the left and right channels are physically isolated. If we get those two signals exactly right, the listener should hear a source out in the world, externalised and localised, even though only two tiny transducers are involved.

That promise — full three-dimensional sound from two channels — is what makes binaural the natural rendering target for VR, AR, mobile immersive audio and headphone monitoring. It is also why binaural is the hardest technique to get right: it depends on the individual anatomy of the listener in a way no loudspeaker method does. This chapter builds the method from first principles: the acoustic goal, the head-related transfer function that captures it, how we measure and store it, why a stranger's ears never quite fit, and the engineering — head-tracking, room simulation, convolution, headphone equalisation — that turns a good model into a convincing illusion. Throughout, the recurring theme of this part holds: we encode a sound scene into a compact representation, then decode it into ear signals for the system actually present, here a pair of headphones on a particular head.

The goal: reconstruct the eardrum signals

From source to eardrum

Consider a single point source at position $\mathbf{r}$ relative to the listener's head, radiating a signal $s(t)$ . The sound propagates through the air, diffracts around the torso, head and pinnae (outer ears), and arrives at the entrance of each ear canal. Because the two ears occupy different points in space and are shadowed differently by the head, the left and right signals differ — and those differences are precisely the cues the auditory system uses to localise sound (see Psychoacoustics).

The path from source to the left eardrum is, to a good approximation, linear and time-invariant for a fixed geometry. Any LTI path is fully described by its impulse response. Call the left and right paths $h_L(t)$ and $h_R(t)$ . Then the signals at the two ears are convolutions:

p_L(t) = (s * h_L)(t), \qquad p_R(t) = (s * h_R)(t).

If we can deliver exactly $p_L(t)$ to the left eardrum and $p_R(t)$ to the right, the listener's auditory system receives the same input it would receive from the real source. The brain has no way to tell the difference: localisation, externalisation and timbre all follow. This is the entire premise of binaural reproduction, and it is the cleanest example of the "reconstruct the physical stimulus" philosophy in audio.

Why headphones are the natural transducer

Headphones place one transducer in front of each ear, with strong acoustic isolation between sides. That isolation is the gift: it lets us deliver $p_L$ to the left ear and $p_R$ to the right with negligible crosstalk — almost none of the left signal leaks to the right ear, and vice versa. Over loudspeakers the same binaural signals are corrupted because each ear hears both speakers; recovering ear-level isolation from speakers requires the crosstalk-cancellation machinery described in Transaural. Headphones make binaural rendering straightforward in principle, leaving us free to concentrate on the harder problem: getting $h_L$ and $h_R$ right.

What "exactly right" demands

The two impulse responses are not arbitrary. Their difference in arrival time encodes the interaural time difference (ITD); their difference in level across frequency encodes the interaural level difference (ILD); and their fine spectral shape, sculpted by the pinna, encodes elevation and front/back. All of this is direction-dependent. To synthesise a moving source, or to support a listener who turns their head, we need a whole family of impulse responses — one pair for every direction. That family is the head-related transfer function, and it is the subject of the next section.

The HRTF and the HRIR

Definition and normalisation

The head-related impulse response (HRIR) is the impulse response of the source-to-ear path for a given direction $(\theta, \phi)$ (azimuth and elevation), measured in anechoic (echo-free) conditions and with the head's own contribution isolated from the loudspeaker, microphone and room. Its Fourier transform is the head-related transfer function (HRTF):

H_{L}(\theta,\phi,f) = \mathcal{F}\!\left\{ h_{L}(\theta,\phi,t) \right\}, \qquad H_{R}(\theta,\phi,f) = \mathcal{F}\!\left\{ h_{R}(\theta,\phi,t) \right\}.

To make the HRTF a property of the listener rather than of the measurement chain, it is defined as a ratio. Let $P_{L}(\theta,\phi,f)$ be the sound pressure at the ear with the head present, and $P_{0}(f)$ the pressure that the same source would produce at the centre of the head with the head absent. Then

H_{L}(\theta,\phi,f) = \frac{P_{L}(\theta,\phi,f)}{P_{0}(f)}.

Dividing by the free-field reference removes the source spectrum, the distance gain and the measurement system, leaving only what the head, torso and pinna do to the sound — exactly the part that carries spatial information.

How the HRIR encodes ITD, ILD and pinna cues

The HRIR is a remarkably compact object — typically a few hundred samples at 44.1 or 48 kHz — yet it packs all three classes of localisation cue into its shape.

Interaural time difference (ITD). For a source off to one side, sound reaches the near ear before the far ear. In the HRIR pair this appears as a relative delay between $h_L$ and $h_R$ . A useful first-principles estimate treats the head as a rigid sphere of radius $a$ ; for a source at azimuth $\theta$ (measured from straight ahead), Woodworth's formula gives

\mathrm{ITD}(\theta) \approx \frac{a}{c}\left(\theta + \sin\theta\right),

with $c \approx 343\ \mathrm{m/s}$ . The term $a\,\theta/c$ is the extra arc the wave travels around the shadowed side; $a\sin\theta/c$ accounts for the straight-line offset. ITD dominates localisation below roughly 1.5 kHz, where the wavelength is long enough that the phase difference is unambiguous.

Interaural level difference (ILD). Above about 1.5 kHz the head casts an acoustic shadow, so the far ear receives less energy. In the HRTF this is a frequency-dependent magnitude difference

\mathrm{ILD}(\theta,\phi,f) = 20\log_{10}\frac{\lvert H_{L}(\theta,\phi,f)\rvert}{\lvert H_{R}(\theta,\phi,f)\rvert}\ \ [\mathrm{dB}],

which can reach 20 dB or more at high frequencies for a lateral source. ILD is the dominant lateral cue above 1.5 kHz, where ITD becomes ambiguous because the wavelength is shorter than the head.

Pinna spectral cues. The folds of the outer ear introduce direction-dependent reflections and resonances. A reflection off a pinna ridge of path-length difference $\Delta d$ relative to the direct path arrives a time $\tau = \Delta d / c$ later, and the sum of direct and reflected paths produces a comb-filter notch at

f_{\text{notch}} = \frac{c}{2\,\Delta d} = \frac{1}{2\tau},

with further notches at odd multiples. As elevation changes, $\Delta d$ changes, so the notch frequencies (typically in the 6–10 kHz region) slide. These pinna notches are the principal cue for elevation and for resolving the front/back ambiguity that ITD and ILD alone cannot, because a front and a rear source can share the same interaural differences. The interaural cues sit on the so-called cone of confusion; only the spectral pinna cues break the tie. All three cue families are explained from the perception side in Psychoacoustics; the HRTF is simply their physical encoding.

A frequency-dependent picture

It helps to keep a mental map of which cue rules which band:

Band	Wavelength vs. head	Dominant cue	Mechanism in HRTF
below ~700 Hz	$\lambda \gg$ head	ITD (phase)	relative delay between HRIRs
700 Hz – 1.5 kHz	$\lambda \sim$ head	ITD + ILD transition	both present, phase still usable
1.5 kHz – 5 kHz	$\lambda <$ head	ILD	magnitude difference grows
5 kHz – 16 kHz	$\lambda \ll$ ear	pinna spectral cues	direction-dependent notches/peaks

This is the duplex theory of Lord Rayleigh (1907) extended upward by the pinna. The HRTF is valuable precisely because it captures all of these simultaneously and consistently for each direction, which no parametric panner does.

Worked example: ITD for a lateral source

Take a head radius $a = 0.0875\ \mathrm{m}$ (a common average) and a source at $\theta = 60^\circ = 1.047\ \mathrm{rad}$ . Woodworth gives

\mathrm{ITD} \approx \frac{0.0875}{343}\,(1.047 + \sin 60^\circ) = 2.551\times10^{-4}\,(1.047 + 0.866) = 2.551\times10^{-4}\times1.913.

So $\mathrm{ITD} \approx 4.88\times10^{-4}\ \mathrm{s} = 488\ \mu\mathrm{s}$ . At a 48 kHz sample rate that is $0.488\,\mathrm{ms}\times48\,000 \approx 23.4$ samples of relative delay between the left and right HRIR — a difference the rendering must reproduce to sub-sample accuracy (by fractional-delay interpolation) if the image is to sit cleanly at $60^\circ$ rather than jumping in coarse steps.

Measuring HRTFs

The anechoic measurement

The defining HRTF measurement is made in an anechoic chamber so that the only path from loudspeaker to microphone is the direct one — no room reflections to contaminate the impulse response. A small microphone is placed at the blocked ear-canal entrance (the canal is occluded so that only the head/pinna transfer function is captured, not the canal resonance, which is later supplied by the headphone). A loudspeaker is moved to a grid of directions around the subject, often on a motorised arc or a spherical gantry, at a fixed radius (typically 1–2 m, far enough to be in the acoustic far field where the HRTF is distance-independent).

For each direction the system plays a known excitation and recovers the impulse response by deconvolution. Exponential sine sweeps (Farina's method) are preferred: a logarithmic sweep lets the linear impulse response be separated from harmonic distortion, which lands at negative times in the deconvolved result and can be windowed away. The captured pressure is divided by a free-field reference measurement (microphone at head centre, head removed) to yield the normalised HRTF. The result for one subject is a dense set of HRIR pairs — hundreds to thousands of directions, each a few-hundred-tap filter.

Dummy heads and KEMAR

Measuring real people is slow and demanding — the subject must stay motionless for the whole grid. A dummy head (manikin) solves this for a representative anatomy. The most influential is KEMAR (Knowles Electronics Manikin for Acoustic Research), introduced by Burkhard and Sachs in 1975 with pinnae and ear-canal dimensions designed to match population medians. KEMAR-based HRTFs, particularly the MIT Media Lab measurement by Gardner and Martin (1994), became a de facto standard for early binaural research and for products that ship a single non-individual HRTF. A dummy head also doubles as a recording device: a binaural microphone placed in its ears captures real scenes directly in binaural form — the basis of binaural field recording, covered in the recording part of this guide.

Public databases

Modern binaural work draws on open multi-subject databases, which is what allows individualization research (next section). The major ones:

Database	Subjects	Notable feature
CIPIC (UC Davis)	45	Includes detailed anthropometric measurements per subject
LISTEN (IRCAM)	51	Measured with morphological data, widely used in research
ARI (Austrian Acad. of Sciences)	200+	Large, high-resolution grids, in-the-canal and blocked variants
SADIE / SADIE II (York)	20 + KEMAR	High spatial resolution, designed for Ambisonic decoding to binaural
RIEC, BiLi, HUTUBS, others	varies	Growing ecosystem, many SOFA-native

CIPIC is especially important because each subject's HRTFs come with anthropometric parameters — head width, head depth, pinna height, concha dimensions and so on — which is the raw material for anthropometry-based individualization.

The SOFA file format

Before 2013 every laboratory stored HRTFs in its own ad hoc layout, which made data exchange painful. The SOFA format — Spatially Oriented Format for Acoustics — standardised this. It is now the AES standard AES69. SOFA is built on a self-describing scientific container (netCDF/HDF5) and stores not just the impulse responses but the full geometry: the coordinates of every source position, the listener and receiver (ear) positions and orientations, the sample rate, and metadata about the measurement. A SimpleFreeFieldHRIR convention, for instance, stores an array of dimensions (measurements × receivers × taps) plus a SourcePosition array giving $(\theta,\phi,r)$ for each measurement.

Because SOFA is self-describing and standard, a renderer can load any compliant database, read the geometry, and interpolate without bespoke parsing. Essentially all serious binaural tools — and DAM Audio's own Binaural Sound Engine (BSE) — consume SOFA, which is why it appears in the glossary and the formats chapter as the lingua franca of HRTF interchange.

The individualization problem

Why a stranger's ears do not fit

The HRTF is a fingerprint of one person's head and ears. Pinna geometry varies enormously between people, and the spectral notches that encode elevation and front/back fall at different frequencies for different listeners. When you listen through someone else's HRTF — a dummy head, or a database average — your brain receives notches at the "wrong" frequencies. The interaural cues (ITD, ILD) are governed by gross head size and so transfer reasonably well between people; the monaural spectral cues do not. The result is a characteristic cluster of failures:

Front/back confusion. A source intended for the front is heard behind, or vice versa, because the pinna cue that would disambiguate the cone of confusion is wrong. Reported confusion rates with non-individual HRTFs commonly reach 10–30%, against a few percent for individual HRTFs.
Weak or collapsed elevation. Without correctly placed notches, sources tend to flatten toward the horizontal plane or sit at an ambiguous height.
In-head localization (IHL). The sound fails to externalise and is heard inside the head, like ordinary stereo. This is the most damning failure for the whole premise of binaural.

Solutions: from selection to machine learning

Several strategies try to close the gap between a generic HRTF and the listener's own:

Best-match selection. Present the listener with a handful of HRTF sets from a database and let them choose the one that localises best, often via a short listening test (e.g. judging the elevation of moving stimuli, or minimising front/back errors). This needs no measurement hardware and meaningfully reduces confusions, though it only selects from existing sets.

Anthropometric tuning. Measure or estimate a few body dimensions — head width and depth, pinna height, concha depth — and either pick the database subject whose anthropometry is closest, or scale a reference HRTF. Because notch frequencies scale roughly inversely with pinna size, a listener with a larger pinna gets notches shifted down. Formally, if a feature vector $\mathbf{x}$ of anthropometric measurements is known, one can regress HRTF parameters $\mathbf{y}$ on $\mathbf{x}$ from a training database:

\hat{\mathbf{y}} = f(\mathbf{x}), \qquad \text{minimising}\quad \sum_{i} \lVert \mathbf{y}_i - f(\mathbf{x}_i)\rVert^2 .

Morphing and structural models. Decompose measured HRTFs into a low-dimensional basis (for example by principal component analysis) so that

H(\theta,\phi,f) \approx \bar{H}(f) + \sum_{k=1}^{K} w_k(\theta,\phi)\,\Phi_k(f),

where $\Phi_k$ are basis spectra shared across people and $w_k$ are direction-dependent weights. Individualization then reduces to estimating a handful of subject weights, which can be tuned, interpolated or "morphed" between subjects.

ML-estimated HRTFs. The current frontier estimates a full individual HRTF set from easy inputs: one or several photographs or a 3D mesh of the head and ears, fed to a neural network trained on databases that pair anthropometry/scans with measured HRTFs. Alternatively, numerical simulation (boundary-element or finite-difference methods) computes the HRTF directly from a 3D scan of the ear by solving the acoustic wave equation around the geometry — physically exact in the limit, bounded mainly by scan resolution and compute. DAM Audio's BSE sits in this space, combining individualized HRTF handling with the rendering and head-tracking pipeline described below.

Key takeaway

Even the best individualization, however, leaves residual externalisation problems if the dynamic and room cues are absent — which is why the next two sections matter as much as the HRTF itself.

Head-tracking: the single biggest improvement

Why static binaural under-performs

A static binaural render fixes the scene to the head: turn your head and the whole world turns with you, which never happens with real sources. The brain treats this rigidly-coupled field as evidence that the sound is attached to the head — i.e. inside it. Worse, the listener loses the most powerful disambiguating cue of all: dynamic cues. In real life we resolve front/back confusion almost instantly by making tiny involuntary head movements and noticing how the interaural cues change. A source in front and a source behind respond oppositely to a head turn, so a fraction of a second of motion settles the question that the static spectrum left ambiguous.

Rotating the scene to keep cues consistent

Head-tracking restores this by making the rendered scene world-stationary. A sensor (IMU in the headphones, or optical tracking) reports the head orientation as a rotation $\mathbf{R}(t)$ . The renderer applies the inverse rotation to every source direction before looking up the HRTF, so a source that is "north" stays north as the head turns:

\hat{\mathbf{d}}_{\text{render}}(t) = \mathbf{R}(t)^{-1}\,\hat{\mathbf{d}}_{\text{world}} .

For Ambisonic scenes this is even cheaper: the whole sound field is rotated in the spherical-harmonic domain by a single rotation matrix applied to the channels, before the binaural decode, with no per-source lookup at all. Either way, when the listener turns their head the interaural and spectral cues change exactly as they would for real external sources, externalisation locks in, and front/back confusions drop dramatically.

Biggest single win

Head-tracking is consistently reported as the largest single improvement available to binaural — often larger than individualization — because it supplies the dynamic cues that static HRTFs structurally cannot.

The latency budget

The catch is latency. If the audio scene lags the head motion, the world appears to "swim", externalisation breaks, and in VR the mismatch with vision contributes to discomfort. The total motion-to-sound latency is a sum:

T_{\text{total}} = T_{\text{sensor}} + T_{\text{filter}} + T_{\text{transmit}} + T_{\text{render}} + T_{\text{buffer}} + T_{\text{transducer}} .

The widely cited perceptual threshold for the detectability of audio scene lag during head motion is around 60–85 ms; for high-quality immersive work designers target well under that, often 20–30 ms end to end, to leave headroom.

Worked example: a latency budget

Suppose we render at 48 kHz with the following chain:

Stage	Latency
IMU sampling + readout	2 ms
Orientation filter (sensor fusion)	3 ms
Wireless transmission to renderer	4 ms
HRTF convolution (block = 256 samples)	256/48000 = 5.33 ms
Output buffering (one extra block)	5.33 ms
DAC + transducer	1 ms
Total	≈ 20.7 ms

This comfortably beats the 60 ms detectability threshold. Now suppose we naively raise the convolution block to 1024 samples to save CPU: the convolution and buffer stages each become $1024/48000 = 21.3\ \mathrm{ms}$ , and the total jumps to roughly $2 + 3 + 4 + 21.3 + 21.3 + 1 = 52.6\ \mathrm{ms}$ — still under threshold but now perilously close, with no margin for a wireless hiccup. This is the central tension of real-time binaural: smaller blocks lower latency but raise CPU cost, which is exactly why partitioned convolution (below) exists.

Common mistake

Note that the audio path latency and the head-tracking path latency add up here; a common mistake is to budget only the tracking sensor and forget that the audio block size dominates.

Adding a room: BRIRs and externalisation

Why reverberation helps

A bare anechoic HRTF render, even individualized and head-tracked, can still sound slightly unnatural because we almost never hear sources in anechoic conditions. Real listening always includes the room: an early reflection pattern and a reverberant tail. These cues do enormous work for externalisation and distance perception. Early reflections give the brain extra, time-shifted copies of the source from other directions, which is strong evidence that the source is out in a space rather than inside the head; the direct-to-reverberant energy ratio encodes distance (see Distance and air and Reverberation).

Rule of thumb

Adding even a modest simulated room to a binaural render typically improves externalisation more than any amount of HRTF tweaking on a dry signal.

The BRIR

The extension of the HRIR to a reverberant space is the binaural room impulse response (BRIR): the impulse response from a source in a room to each ear, measured (or simulated) with the head in place. A BRIR contains the direct-path HRIR followed by the room's reflections, each itself filtered by the HRTF for its direction of arrival:

h^{\text{BRIR}}_{L}(t) = h^{\text{HRIR}}_{L,\,\hat{\mathbf{d}}_0}(t) + \sum_{i} a_i\, h^{\text{HRIR}}_{L,\,\hat{\mathbf{d}}_i}\!\big(t - \tau_i\big) + \text{(diffuse tail)} .

Here $\hat{\mathbf{d}}_0$ is the direct direction, and each reflection $i$ arrives from direction $\hat{\mathbf{d}}_i$ with delay $\tau_i$ and gain $a_i$ , convolved with that direction's HRIR. A BRIR can be captured directly in a real room with a dummy head, or synthesised: trace early reflections geometrically (image-source method), assign each its HRIR, and append a statistically appropriate diffuse reverberation tail. For head-tracking the direct part must rotate with the head while the diffuse tail can stay roughly orientation-independent, so renderers often split the BRIR into a tracked early part and an un-tracked late part.

The trade-off is that BRIRs are much longer than HRIRs — a room tail can be tens of thousands of taps — so room rendering is far more expensive than dry HRTF rendering, which again points to partitioned convolution. The art is to add enough room to externalise without smearing the scene or destroying the dry intelligibility the application needs (a monitoring engineer wants a controlled, repeatable room; a VR game wants the room to match the visual environment).

The rendering pipeline

Encode the scene, then decode to the ears

Binaural rendering is the encode/decode pattern made concrete. Two scene representations dominate.

Object-based scenes. Each source is an audio signal plus time-varying metadata: direction $(\theta,\phi)$ , distance, gain. This is the object-based representation. To render, for each object the engine fetches the HRIR pair for its current direction (interpolating between measured grid points) and convolves:

y_L(t) = \sum_{n} \big(s_n * h_{L,\,\hat{\mathbf{d}}_n}\big)(t), \qquad y_R(t) = \sum_{n} \big(s_n * h_{R,\,\hat{\mathbf{d}}_n}\big)(t).

This is direct-HRTF rendering: maximally accurate per source, but it costs two convolutions per object and demands careful interpolation as objects move so that the ITD (a delay) and the spectral envelope cross-fade without clicks.

Ambisonic scenes. A scene authored or captured in Ambisonics is a fixed set of spherical-harmonic channels regardless of how many sources it contains. Head-tracking is a single rotation of those channels. The binaural decode then needs only a fixed number of filters, independent of source count — a decisive advantage for dense or unpredictable scenes (e.g. live VR).

Virtual-loudspeaker vs. direct-HRTF rendering

There are two ways to turn an Ambisonic (or surround) scene into ear signals.

Virtual loudspeakers. First decode the scene to a virtual loudspeaker array — say a $t$ -design of $V$ points on a sphere — exactly as you would for real speakers, then replace each virtual speaker with the HRIR pair for its direction and sum:

y_L(t) = \sum_{v=1}^{V} \big(g_v * h_{L,\,\hat{\mathbf{d}}_v}\big)(t),

where $g_v$ is the loudspeaker feed produced by the Ambisonic decoder. This reuses the entire loudspeaker decoding chain and only needs $V$ fixed HRIR pairs, but it inherits the spatial resolution limits of the virtual array and of the Ambisonic order.

Direct / SH-domain binaural decode. More efficient and now standard is to pre-combine the Ambisonic decode and the HRTF lookup into a single set of binaural decoding filters — one filter per spherical-harmonic channel for each ear. The ear signal is then a sum of convolutions over the $(N+1)^2$ channels of an order- $N$ scene:

Y_L(f) = \sum_{m} B_{L,m}(f)\, A_m(f),

where $A_m$ are the Ambisonic channel signals and $B_{L,m}$ are precomputed binaural filters. The number of filters depends only on the Ambisonic order, not the number of sources, which is why scene-based binaural scales so gracefully. The trade-off is the usual one: a low Ambisonic order blurs high-frequency spatial detail, so high-frequency individualized cues are partly averaged out — mitigated by techniques such as magnitude-least-squares filter design.

CPU cost and partitioned convolution

The work is dominated by convolution. A direct time-domain FIR of length $N$ taps costs $N$ multiply-adds per output sample per filter. For an HRIR of $N = 256$ taps, two ears, 48 kHz and 32 objects that is

256 \times 2 \times 48\,000 \times 32 \approx 7.9\times10^{8}\ \text{MACs/s},

already heavy; a BRIR of $N = 48\,000$ taps (1 s) would be over $1.4\times10^{11}$ MACs/s per ear for one source — infeasible in the time domain.

The standard remedy is fast convolution via the FFT. Block convolution with overlap-add reduces the cost per output sample from $O(N)$ to $O(\log N)$ , but a single large FFT over the whole filter forces you to wait for a full block of $N$ input samples — reintroducing the very latency we fought in the head-tracking budget. Partitioned convolution resolves the conflict: split the long filter into $K$ shorter partitions, convolve the most recent input block against the first partition with a small FFT (low latency), and convolve older input blocks against later partitions with progressively larger FFTs whose latency is hidden because those blocks are already available. The result is zero (or low) input-to-output latency with near-FFT efficiency, which is what makes head-tracked BRIR rendering of long room tails real-time. In practice a renderer uses a small first partition (e.g. 128–256 samples, matching the latency budget) and larger later partitions for the reverb tail. This is the workhorse of every serious binaural engine, including the convolution core inside DAM's BSE.

Headphone equalization

Why the headphone is part of the path

We measured the HRTF at the blocked ear-canal entrance, deliberately excluding the ear canal and the playback transducer. But the actual signal reaching the eardrum during playback passes through the headphone and the residual ear-canal coupling. The headphone itself has a frequency response — call its transfer function from electrical input to ear-drum pressure $H_{\text{HP}}(f)$ . If we ignore it, the listener hears the HRTF coloured by the headphone, corrupting the very spectral notches that carry elevation and front/back.

The fix is headphone equalization: insert a compensation filter $C(f)$ that flattens the headphone-to-ear path so that the only spectral shaping that survives is the HRTF we intend:

C(f) \approx \frac{1}{H_{\text{HP}}(f)} .

Because a literal inverse can blow up at deep response nulls, $C(f)$ is built as a regularised inverse — limiting gain where $\lvert H_{\text{HP}}(f)\rvert$ is small, and using minimum-phase or smoothed magnitude targets to avoid audible pre-ringing:

C(f) = \frac{H_{\text{HP}}^{*}(f)}{\lvert H_{\text{HP}}(f)\rvert^{2} + \beta(f)},

with a frequency-dependent regularisation $\beta(f)$ raised in the problematic high-frequency region.

Practical realities

Headphone equalization is harder than loudspeaker EQ for two reasons. First, the headphone-to-ear response is highly individual above a few kHz, because it depends on how the cushion couples to your ear; a single factory curve can only partly compensate. Second, the response changes every time you reseat the headphones — a few millimetres of repositioning can shift high-frequency features by several dB, which is why repeatability is a known weakness. The best results use per-headphone-model compensation (and ideally per-listener), measured with a probe microphone, and a headphone with low reseating variance.

Common mistake

A common mistake is to skip headphone EQ entirely and blame the resulting front/back errors on the HRTF, when the headphone's own peaks were the culprit.

For monitoring applications the headphone EQ is as important as the HRTF, because the whole point is a neutral, predictable spectrum.

Limits and use cases

Inherent limits

Binaural over headphones is the most direct route to 3D audio, but it carries structural limitations:

Individualization dependence. As covered above, non-individual HRTFs cause front/back confusion, flattened elevation and in-head localization. This is the defining weakness; everything else is engineering.
The "phantom" still needs dynamics and room. Even individual HRTFs externalise poorly when static and dry. Head-tracking and a room are not luxuries; they are part of the method.
Headphone EQ instability. Reseating variance and individual ear coupling limit how repeatable the high-frequency spectrum can be.
High-frequency interpolation and bandwidth. Measured grids are finite, so directions between grid points must be interpolated; naive interpolation smears the high-frequency notches. Ambisonic binaural decoding additionally band-limits spatial detail by order.
No shared experience. Headphone binaural is inherently single-listener and isolating; it cannot serve a room full of people the way loudspeakers can, and it removes the bone-conduction and tactile components of loud real sources.
Listener fatigue and pressure. Long sessions under headphones, especially closed-back, bring comfort and isolation issues that loudspeaker monitoring does not.

Not the same as headphone stereo

These are why binaural complements rather than replaces loudspeaker techniques. Note also that ordinary two-channel stereo over headphones is not binaural — it carries ILD-like level differences but no consistent HRTF or ITD structure, which is why it images inside the head; see Is stereo already spatial?.

Use cases

Despite the limits, binaural is the default rendering target wherever the listener is on headphones:

VR / AR. Head-tracking is already present for the visuals, so world-locked binaural audio is natural and essential; the head-tracking that binaural needs most is free.
Mobile and streaming immersive audio. Object and Ambisonic immersive formats are delivered to phones and earbuds, where binaural is the only viable spatial reproduction. Many earbuds now include IMUs, enabling head-tracking on consumer hardware.
Immersive monitoring. Engineers mixing immersive content can audition a multichannel or object mix over headphones via a virtual loudspeaker binaural render — useful for working on the move, or as a cross-check against the dubbing stage. Repeatability and headphone EQ are paramount here.
Research, accessibility and gaming. Localisation training, hearing research, navigation aids for the visually impaired, and competitive game audio all rely on accurate binaural cues.

Worked example: a front source failing, then fixed by head-tracking

Consider a single object placed dead ahead in the horizontal plane, $(\theta,\phi) = (0^\circ, 0^\circ)$ , rendered with a non-individual HRTF whose pinna notch sits at 8.0 kHz, played to a listener whose own front pinna notch is at 7.0 kHz and whose rear notch is at 8.2 kHz.

Static render. The listener's brain compares the received spectrum against its internal templates. The rendered notch at 8.0 kHz is far from the listener's front template (7.0 kHz) but close to the rear template (8.2 kHz). With ITD and ILD both near zero (the source is on the median plane, so interaural cues cannot disambiguate front from back), the spectral cue dominates — and it points backward. The listener perceives the source behind them: a classic front/back reversal. Quantitatively, the interaural cues sit on the cone of confusion ( $\mathrm{ITD}\approx 0$ , $\mathrm{ILD}\approx 0\ \mathrm{dB}$ ), so the only tie-breaker is the (wrong) pinna spectrum.

Add head-tracking. Now the listener turns their head 15° to the right while the renderer keeps the object world-stationary. The object's rendered direction becomes $\theta_{\text{render}} = -15^\circ$ relative to the now-rotated head — i.e. it moves toward the listener's left-front. The renderer fetches the HRIR for $-15^\circ$ , which has a non-zero ITD and ILD favouring the left ear. Using the spherical model with $a = 0.0875\ \mathrm{m}$ :

\mathrm{ITD}(15^\circ) \approx \frac{0.0875}{343}\big(0.262 + \sin 15^\circ\big) = 2.551\times10^{-4}\,(0.262 + 0.259) = 1.33\times10^{-4}\ \mathrm{s} \approx 133\ \mu\mathrm{s},

with the left ear leading. Crucially, the sign and rate of change of this ITD as the head rotates are consistent with a frontal source: turn right, and a front source's image moves left and its near ear swaps accordingly. A rear source would produce the opposite dynamic. The brain reads the dynamic cue, overrides the ambiguous static spectrum, and re-localises the object to the front where it belongs. This is the mechanism behind the empirical result that head-tracking can cut front/back confusions from tens of percent down to a few percent — often a larger improvement than swapping in a better static HRTF. It is the clearest demonstration of why, in binaural, motion is information.

Summary

Binaural synthesis is the most literal answer to spatial audio's central question: reconstruct the two eardrum signals a real source would create, and the brain does the rest. The encode is a sound scene — objects or an Ambisonic field — and the decode is convolution with the HRTFs of the listener's head, delivered isolated over headphones. The physics is captured in the HRIR, which packs ITD, ILD and pinna spectral cues into a few hundred taps; the data is measured anechoically, stored as SOFA/AES69, and shared through databases like CIPIC and SADIE. The hard part is that this fingerprint is personal: non-individual HRTFs cause front/back confusion, weak elevation and in-head localization, which individualization — selection, anthropometry, morphing, ML and BEM simulation — tries to fix. But the two interventions that most reliably turn a model into a convincing illusion are head-tracking, which supplies the dynamic cues the static spectrum lacks and keeps the scene world-stationary within a tight latency budget, and a room (the BRIR), whose reflections externalise the image. The whole chain runs in real time through partitioned convolution, is conditioned by careful headphone equalization, and underpins everything from VR and mobile immersive audio to headphone monitoring — including DAM Audio's Binaural Sound Engine. For the loudspeaker counterpart that delivers binaural signals without headphones, continue to Transaural.

References

Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised edition). MIT Press. — The foundational text on localisation cues, the cone of confusion and front/back perception.
Møller, H. (1992). "Fundamentals of binaural technology." Applied Acoustics, 36(3–4), 171–218. — Rigorous treatment of the binaural signal chain, blocked-canal measurement and headphone equalization.
Algazi, V. R., Duda, R. O., Thompson, D. M., & Avendano, C. (2001). "The CIPIC HRTF Database." Proc. IEEE WASPAA, 99–102. — The CIPIC database and its anthropometric parameters.
Begault, D. R. (1994). 3-D Sound for Virtual Reality and Multimedia. Academic Press / NASA. — Classic applied reference on binaural synthesis, externalisation, head-tracking and room cues.
Xie, B. (2013). Head-Related Transfer Function and Virtual Auditory Display (2nd ed.). J. Ross Publishing. — Comprehensive monograph on HRTF theory, measurement, modelling and individualization.
Audio Engineering Society (2015, rev. 2022). AES69: AES standard for file exchange — Spatial acoustic data file format (SOFA). — The SOFA standard for HRTF/BRIR interchange.
Burkhard, M. D., & Sachs, R. M. (1975). "Anthropometric manikin for acoustic research." Journal of the Acoustical Society of America, 58(1), 214–222. — The original KEMAR manikin.
Gardner, W. G., & Martin, K. D. (1995). "HRTF measurements of a KEMAR." Journal of the Acoustical Society of America, 97(6), 3907–3908. — The widely used MIT KEMAR HRTF measurement.
Wightman, F. L., & Kistler, D. J. (1999). "Resolution of front–back ambiguity in spatial hearing by listener and source movement." Journal of the Acoustical Society of America, 105(5), 2841–2853. — Experimental basis for head-tracking's effect on front/back confusion.

← Back to Spatialization Techniques

The goal: reconstruct the eardrum signals​

From source to eardrum​

Why headphones are the natural transducer​

What "exactly right" demands​

The HRTF and the HRIR​

Definition and normalisation​

How the HRIR encodes ITD, ILD and pinna cues​

A frequency-dependent picture​

Worked example: ITD for a lateral source​

Measuring HRTFs​

The anechoic measurement​

Dummy heads and KEMAR​

Public databases​

The SOFA file format​

The individualization problem​

Why a stranger's ears do not fit​

Solutions: from selection to machine learning​

Head-tracking: the single biggest improvement​

Why static binaural under-performs​

Rotating the scene to keep cues consistent​

The latency budget​

Worked example: a latency budget​

Adding a room: BRIRs and externalisation​

Why reverberation helps​

The BRIR​

The rendering pipeline​

Encode the scene, then decode to the ears​

Virtual-loudspeaker vs. direct-HRTF rendering​

CPU cost and partitioned convolution​

Headphone equalization​

Why the headphone is part of the path​

Practical realities​

Limits and use cases​

Inherent limits​

Use cases​

Worked example: a front source failing, then fixed by head-tracking​

Summary​

References​