Transaural - Binaural over Loudspeakers

Transaural reproduction is the art of delivering a binaural signal — a pair of pressures designed to arrive, one each, at the listener's two eardrums — using ordinary loudspeakers instead of headphones. It sounds like it should be impossible. Headphones work precisely because each transducer is acoustically sealed to one ear: the left earpiece feeds the left eardrum and essentially nothing else. Loudspeakers have no such isolation. Each speaker radiates into a room and reaches both ears, and the two unwanted paths — left speaker to right ear, right speaker to left ear — are exactly what a binaural signal must not have. Transaural systems solve this by anticipating the leakage and pre-cancelling it: the loudspeaker feeds are computed so that, after the room has done its crosstalk mixing, what remains at each eardrum is the intended binaural signal and only that.

This is the same encode-then-decode structure that organizes every technique in this part of the guide. The encode is binaural authoring: a signal pair carrying the head-related cues for a desired direction, exactly as in binaural reproduction. The decode is new and specific to transaural: a crosstalk-cancellation (XTC) filter network that inverts the acoustic transfer matrix from the loudspeakers to the ears, so the binaural pair survives the trip through the air. Understanding transaural therefore means understanding that transfer matrix, why its inverse is numerically dangerous, and what compromises — regularization, narrow speaker spacing, head tracking — make the dangerous inverse usable in a real room.

This chapter builds the theory from first principles, works through the algebra of a 2×2 cancellation with numbers, then turns to the engineering realities: ill-conditioning, the stereo dipole of Kirkeby and Nelson, the tiny sweet spot, spectral colouration, and the practical systems (desktop 3D audio, soundbar virtualization) that ship transaural today. DAM Audio's work on this topic is collected under XTC / transaural crosstalk cancellation.

The problem: crosstalk destroys the binaural cues

What binaural assumes

A binaural signal is a pair of time- and frequency-dependent pressures, $b_L(t)$ and $b_R(t)$ , constructed so that if $b_L$ reaches only the left eardrum and $b_R$ reaches only the right eardrum, the listener hears a virtual source at the intended location. The construction is direction-dependent filtering by head-related transfer functions (HRTFs). For a source at direction $\theta$ feeding a monophonic signal $s$ , the binaural pair is

b_L = H_L(\theta)\, s, \qquad b_R = H_R(\theta)\, s,

where $H_L$ and $H_R$ are the left- and right-ear HRTFs. The information the brain uses — the interaural time difference (ITD), the interaural level difference (ILD), and the spectral notches and peaks of the pinna — is encoded entirely in the difference and detail of this pair. The fundamentals are developed in psychoacoustics and the authoring in binaural. The crucial property for this chapter is the channel-to-ear isolation assumption baked into the equations above: $b_L \to$ left ear, $b_R \to$ right ear, with no mixing. Headphones honour that assumption. Loudspeakers, by their nature, violate it.

What loudspeakers actually do

Place two loudspeakers in front of a listener. Drive the left speaker with some signal $v_L$ and the right with $v_R$ . The pressure at the left eardrum is not $v_L$ . It is the sum of two contributions: the ipsilateral path from the left speaker (short, direct, same side) and the contralateral path from the right speaker (longer, around the head, opposite side). The same is true at the right ear. In symbols, with $C_{ij}$ denoting the acoustic transfer function from speaker $j$ to ear $i$ ,

p_L = C_{LL}\, v_L + C_{LR}\, v_R, \qquad p_R = C_{RL}\, v_L + C_{RR}\, v_R.

The two unwanted terms, $C_{LR} v_R$ and $C_{RL} v_L$ , are the crosstalk. They are not small. For loudspeakers at $\pm 30°$ the extra path length to the far ear is on the order of 15–20 cm, the level is only a few decibels down from the direct path at low and mid frequencies, and the head shadow that attenuates the far-ear signal is significant only above roughly 1.5 kHz. So if you simply send a binaural signal to two loudspeakers — $v_L = b_L$ , $v_R = b_R$ — each eardrum receives its intended signal plus a delayed, head-filtered copy of the other channel.

Crosstalk collapses the image

That contamination is catastrophic for the binaural cues. The ITD encoded in $b_L$ versus $b_R$ relies on a clean comparison of arrival times at the two ears; the crosstalk superimposes a second arrival at each ear, with its own delay set by the loudspeaker geometry, not by the intended source direction. The ILD is similarly corrupted, and the pinna spectral cues are smeared by comb filtering between the direct and crosstalk paths. The result is a collapsed, frontally-locked, comb-coloured image — the familiar "phantom-centre plus narrow stage" of ordinary stereo, not the open spatial scene the binaural signal described. Crosstalk does not merely degrade binaural-over-speakers; it reduces it to conventional stereo.

The goal stated precisely

Transaural's objective is to choose loudspeaker feeds $v_L, v_R$ such that the eardrum pressures equal the binaural target:

p_L = b_L, \qquad p_R = b_R.

Because the air imposes the mixing $\mathbf{p} = \mathbf{C}\,\mathbf{v}$ , the feeds we must send are $\mathbf{v} = \mathbf{C}^{-1}\mathbf{b}$ . Everything that follows — the conditioning problems, the regularization, the dipole geometry, the tiny sweet spot — flows from the difficulty of building and using that inverse $\mathbf{C}^{-1}$ in a real acoustic system. The encode is binaural; the decode is the inverse of the room's own crosstalk.

The acoustic transfer matrix

The 2×2 plant

Collect the four paths into a frequency-domain matrix. At each frequency $\omega$ ,

\mathbf{C}(\omega) = \begin{bmatrix} C_{LL} & C_{LR} \\ C_{RL} & C_{RR} \end{bmatrix}, \qquad \mathbf{p} = \mathbf{C}\,\mathbf{v}.

This $\mathbf{C}$ is called the plant matrix or the acoustic transfer matrix. Its diagonal entries $C_{LL}, C_{RR}$ are the ipsilateral (direct) paths; its off-diagonal entries $C_{LR}, C_{RL}$ are the contralateral (crosstalk) paths. Each entry is itself a complete transfer function: a delay (the time of flight from speaker to ear), an amplitude (governed by distance attenuation $\propto 1/r$ and head shadowing), and a complex frequency response (the HRTF for the loudspeaker's direction relative to the head, including pinna and torso effects). In other words, the plant entries are themselves HRTFs — specifically, the HRTFs for the loudspeaker positions.

Symmetry and the ideal listening geometry

For a listener seated symmetrically between two mirror-image loudspeakers (the canonical equilateral or near-equilateral arrangement), the geometry is left-right symmetric. Then the ipsilateral paths are equal, $C_{LL} = C_{RR} \equiv C_i$ , and the contralateral paths are equal, $C_{LR} = C_{RL} \equiv C_c$ . The plant simplifies to a symmetric matrix

\mathbf{C} = \begin{bmatrix} C_i & C_c \\ C_c & C_i \end{bmatrix},

with $C_i$ the ipsilateral (direct, near-ear) response and $C_c$ the contralateral (crosstalk, far-ear) response. This symmetry is enormously convenient: it lets us diagonalize the problem into a sum channel and a difference channel, which is the standard way to analyze and to implement XTC, and it makes the conditioning behaviour transparent. We will use it throughout, returning to the asymmetric case only when discussing head tracking and off-centre listeners.

A first-principles model of the two paths

To get intuition and numbers, model the ipsilateral and contralateral paths by their dominant features: a delay and a level. Let the direct path from a speaker to its near ear have length $r_i$ and the cross path to the far ear have length $r_c = r_i + \Delta r$ , where $\Delta r$ is the extra distance the sound travels around the head. With sound speed $c \approx 343$ m/s, the interaural path delay for the loudspeaker is

\tau = \frac{\Delta r}{c}.

A spherical-head approximation for a source at azimuth $\theta$ from the median plane gives the classic Woodworth extra-path estimate

\Delta r \approx a\,(\theta + \sin\theta),

with $a \approx 0.0875$ m the effective head radius and $\theta$ in radians. For loudspeakers at $\theta = 30° = 0.524$ rad: $\sin 30° = 0.5$ , so $\Delta r \approx 0.0875\,(0.524 + 0.5) = 0.0896$ m, about 9 cm, giving

\tau = \frac{0.0896}{343} \approx 261\ \mu\text{s}.

The level difference between the cross and direct paths has two parts: spreading loss $1/r$ and head shadowing. Spreading alone for $r_i \approx 1.0$ m and $r_c \approx 1.09$ m is only $20\log_{10}(1.0/1.09) \approx -0.7$ dB — negligible. Head shadowing is frequency-dependent and dominates above ~1.5 kHz, reaching 5–15 dB at high frequencies but near 0 dB at low frequencies. So at low frequencies the crosstalk arrives only ~0.7 dB quieter than the direct sound and ~0.26 ms later: a strong, nearly-equal interfering copy. This is precisely why crosstalk is so destructive, and why cancelling it is hard at low frequencies and at high frequencies for opposite reasons, as the conditioning analysis will show.

Crosstalk cancellation: inverting the plant

The inverse filter

We want $\mathbf{p} = \mathbf{b}$ with $\mathbf{p} = \mathbf{C}\mathbf{v}$ , so the crosstalk-cancellation filter network is the matrix inverse of the plant:

\mathbf{v} = \mathbf{H}\,\mathbf{b}, \qquad \mathbf{H} = \mathbf{C}^{-1}.

$\mathbf{H}$ is a 2×2 matrix of filters applied to the binaural pair before sending it to the loudspeakers. For the symmetric plant, the inverse has a clean closed form. With $\det \mathbf{C} = C_i^2 - C_c^2$ ,

\mathbf{H} = \mathbf{C}^{-1} = \frac{1}{C_i^2 - C_c^2} \begin{bmatrix} C_i & -C_c \\ -C_c & C_i \end{bmatrix}.

Read this physically. The diagonal term $C_i/(C_i^2 - C_c^2)$ is a frequency-shaping filter that equalizes the direct path. The off-diagonal term $-C_c/(C_i^2 - C_c^2)$ is the active cancellation: into the opposite loudspeaker it injects a signal that is the negative of the crosstalk, scaled and filtered so that when it arrives at the far ear it destructively cancels the unwanted leakage. The "−" sign is the whole idea: to silence the contralateral leakage of the left channel into the right ear, drive the right speaker with an inverted, appropriately delayed and filtered anti-signal.

The recursive cancellation picture

There is a beautifully intuitive way to see crosstalk cancellation that predates the matrix formulation and explains the convergence. Suppose we want only $b_L$ at the left ear and silence at the right ear. Send $b_L$ from the left speaker. It reaches the left ear (good) but also leaks to the right ear via $C_c/C_i$ relative to the direct. To cancel that leakage, send from the right speaker a signal $-(C_c/C_i)\,b_L$ , time-aligned and filtered, so it arrives at the right ear and cancels. But that cancelling signal itself leaks back to the left ear, where it is unwanted, so we must add a correction from the left speaker to cancel that, which leaks again to the right ear, and so on. Each round trip multiplies the residual by the crosstalk-to-direct ratio

\beta(\omega) = \frac{C_c(\omega)}{C_i(\omega)},

so the total cancellation is a geometric series

1 - \beta + \beta^2 - \beta^3 + \cdots = \frac{1}{1 + \beta}.

The series converges whenever $|\beta| < 1$ , i.e. whenever the crosstalk is weaker than the direct sound — which is true for sensible geometries. Summing the geometric series recovers exactly the matrix inverse above. This recursive view is due in spirit to Atal and Schroeder and was the basis of the earliest analog cancellers; it makes plain that XTC is fundamentally an acoustic feedback cancellation whose convergence rate depends on how close the crosstalk level is to the direct level.

A worked conceptual example

Take a single frequency where we model the paths by magnitude and delay only. Let the direct path be unit gain at zero reference delay, and the crosstalk be $\beta = 0.85$ in magnitude with the extra delay $\tau = 0.26$ ms computed earlier; at this frequency suppose the phase of $\beta$ is such that we can treat it as a real attenuation for illustration. We want $b_L = 1$ at the left ear and $0$ at the right ear.

Round 0: left speaker emits $1$ . Left ear receives $1$ (target met); right ear receives $\beta = 0.85$ (unwanted).

Round 1: right speaker emits $-0.85$ to cancel. Right ear now receives $0.85 - 0.85 = 0$ ; but this signal leaks to the left ear as $-0.85 \times 0.85 = -0.7225$ , corrupting the target.

Round 2: left speaker adds $+0.7225$ to restore the left ear; this leaks to the right ear as $+0.7225 \times 0.85 = +0.614$ , breaking the cancellation again.

Continuing, the left-ear signal is $1 - 0.7225 + 0.522 - \cdots = 1/(1-\beta^2) \times \text{(direct terms)}$ and the residual at the right ear tends to zero. The net loudspeaker drive required is the matrix-inverse solution. Plugging $C_i = 1$ , $C_c = 0.85$ into the closed form, $\det = 1 - 0.7225 = 0.2775$ , so the diagonal filter gain is $1/0.2775 = 3.60$ and the off-diagonal is $-0.85/0.2775 = -3.06$ . To play a binaural pair through this geometry we must therefore boost the loudspeaker drive by about $20\log_{10}(3.60) \approx 11$ dB relative to the binaural signal, and inject an anti-phase cross term nearly as large. That large boost — the inverse of a small determinant $C_i^2 - C_c^2$ when $C_c$ approaches $C_i$ — is the seed of every practical problem with transaural, and it is the subject of the next section.

Ill-conditioning and regularization

Why the inverse blows up

The cancellation gain scales with $1/(C_i^2 - C_c^2) = 1/(C_i^2(1 - \beta^2))$ . Whenever $|\beta| \to 1$ — the crosstalk approaches the direct path in magnitude and aligns in phase — the determinant approaches zero and the inverse filter demands enormous gain.

Ill-conditioning amplifies every error

This is ill-conditioning: small acoustic quantities in the denominator produce huge filter responses, and tiny errors in the measured or modelled plant (a head that moved a centimetre, a room reflection, an HRTF mismatch) get amplified by the same huge factor.

Crucially, $\beta(\omega)$ is frequency-dependent, and there are specific frequencies where the direct and cross paths interfere most severely. Because the crosstalk is delayed by $\tau$ , the relative phase between direct and cross is $\omega \tau$ , which sweeps through $0, \pi, 2\pi, \ldots$ as frequency rises. At frequencies where the contralateral path is in phase with what the inverse needs to cancel and nearly equal in magnitude, the plant is near-singular and the required boost spikes. Equivalently, in the sum/difference decomposition the plant's two eigen-responses are $C_i + C_c$ (sum channel) and $C_i - C_c$ (difference channel); the difference channel has deep nulls wherever $C_i \approx C_c$ , and inverting a null means infinite gain. These nulls occur roughly at frequencies where the wavelength makes the head-around path interfere destructively — concentrated at low frequencies (where $\beta \to 1$ because head shadowing vanishes) and at a comb of mid/high frequencies set by $\tau$ .

The condition number

The severity is quantified by the condition number of $\mathbf{C}$ , the ratio of its largest to smallest singular value:

\kappa(\mathbf{C}) = \frac{\sigma_{\max}}{\sigma_{\min}} = \left| \frac{C_i + C_c}{C_i - C_c} \right| \quad \text{(symmetric case)}.

When $\kappa$ is large the inversion is unstable: the system pours energy into the difference channel to no audible benefit, amplifies measurement error, and produces wild loudspeaker excursions. When $\kappa \approx 1$ the plant is well-behaved. Robust XTC design is essentially the project of keeping $\kappa(\omega)$ small across as wide a band as possible — and the two main levers are regularization (this section) and loudspeaker geometry (the stereo dipole, next section).

Regularization: trading cancellation for robustness

The cure for an explosive inverse is to refuse to fully invert near the singularities. Instead of $\mathbf{C}^{-1}$ we use a regularized inverse, the Tikhonov / least-squares solution

\mathbf{H} = \big(\mathbf{C}^{H}\mathbf{C} + \beta_{\text{reg}}(\omega)\,\mathbf{I}\big)^{-1}\mathbf{C}^{H},

where $\mathbf{C}^H$ is the conjugate transpose and $\beta_{\text{reg}}(\omega) \ge 0$ is a regularization parameter (a small "effort penalty"). The term $\beta_{\text{reg}}\mathbf{I}$ keeps the matrix being inverted away from singularity: where the plant is well-conditioned and $\sigma_{\min} \gg \beta_{\text{reg}}$ , the regularized inverse equals the true inverse and cancellation is full; where the plant is near-singular and $\sigma_{\min} \to 0$ , the penalty dominates, the gain is capped at roughly $1/\sqrt{\beta_{\text{reg}}}$ , and the system gracefully gives up trying to cancel rather than blowing up.

Equivalently, each singular value $\sigma_k$ of the plant is inverted not as $1/\sigma_k$ but as the filtered version

\frac{\sigma_k}{\sigma_k^2 + \beta_{\text{reg}}},

which behaves like $1/\sigma_k$ for large $\sigma_k$ and like $\sigma_k/\beta_{\text{reg}} \to 0$ for small $\sigma_k$ . The penalty thus rolls off the inversion exactly in the troublesome bands. This is mathematically identical to the constrained-optimization view of Kirkeby and Nelson, in which one minimizes the reproduction error subject to a penalty on loudspeaker effort, with $\beta_{\text{reg}}$ the Lagrange multiplier balancing the two.

The bandwidth/colouration trade-off

Regularization is not free. Wherever the penalty suppresses the inverse, two things happen: the crosstalk is not fully cancelled there (loss of channel separation, hence weakened spatial cues in that band), and the frequency response of the delivered signal departs from flat (audible colouration). The design choice is therefore a trade-off:

Small $\beta_{\text{reg}}$ : maximal cancellation and the widest theoretical separation, but huge gains at the singular frequencies, severe colouration, large speaker excursion, and extreme sensitivity to head movement and plant error. Brittle.
Large $\beta_{\text{reg}}$ : robust, flatter, gentler on the speakers, but reduced cancellation depth and narrower usable bandwidth where the XTC actually works.

Rule of thumb

A common refinement is to make $\beta_{\text{reg}}(\omega)$ frequency-dependent — small in the mid-band where the plant is naturally well-conditioned and cancellation is most perceptually valuable, larger at the low end and at the high-frequency singular combs where full inversion would be ruinous.

This shapes the achievable XTC into a usable mid-band passband flanked by deliberately under-corrected extremes. The figure of merit Kirkeby and others use is the trade-off between channel separation (how many dB of crosstalk are removed) and spectral flatness / dynamic range; you cannot maximize both, and the regularization profile is where the engineer chooses the balance. The next section shows how geometry can shift the whole trade-off in the designer's favour, so that less regularization is needed.

The stereo dipole (Kirkeby and Nelson)

The insight: bring the speakers together

The singular frequencies of the plant are set by the interaural path delay $\tau$ , which is set by the angular separation of the loudspeakers. Wide-spaced speakers (e.g. $\pm 30°$ ) give a large $\tau$ and therefore a dense comb of ill-conditioned frequencies starting low; the difference channel $C_i - C_c$ nulls early and often, so robust cancellation is confined to a narrow band. Kirkeby, Nelson, Hamada and Orduña-Bustamante's key realization (mid-1990s) was that if you move the two loudspeakers close together, subtending a small angle at the listener — they proposed about $\pm 5°$ , a total span of roughly 10° — the geometry changes the conditioning dramatically. They named this closely-spaced pair the stereo dipole.

With a small span, the extra path $\Delta r$ to the far ear shrinks, $\tau$ shrinks, and the first difference-channel null is pushed up in frequency. The plant stays well-conditioned over a much wider band, so the inverse needs far less regularization to be robust, and the usable XTC bandwidth widens. The cost is borne mostly at low frequencies (discussed below), but across the perceptually critical mid-band the closely-spaced pair is markedly better-behaved than a wide pair.

Why "dipole"

At low frequencies the two closely-spaced speakers are driven largely in anti-phase by the cancellation filters (the difference channel dominates the cross-cancellation), so the pair radiates like an acoustic dipole — two nearly-coincident opposite sources. A dipole's far-field pressure rises with frequency (its radiation is weak at low frequencies, where the opposite sources cancel each other in the far field), which is the physical face of the same low-frequency difficulty seen in the conditioning: delivering separation at low frequencies from closely-spaced sources demands large, opposed excursions for little acoustic output. The name captures both the geometry (two close sources) and the radiation behaviour (dipolar at low frequency).

Geometry and a numeric comparison

Compare the interaural path delay for a wide pair and a stereo dipole using $\Delta r \approx a(\theta + \sin\theta)$ with $a = 0.0875$ m, then the first difference-channel null near $f_1 \approx 1/(2\tau)$ (the lowest frequency at which the cross path is a half-cycle out of step over the band, a rough but instructive estimate).

Configuration	Half-angle $\theta$	$\Delta r$ (m)	$\tau$ ( $\mu$ s)	First null estimate $f_1 \approx 1/(2\tau)$
Wide stereo	30° (0.524 rad)	0.0896	261	≈ 1.9 kHz
Standard narrow	15° (0.262 rad)	0.0459	134	≈ 3.7 kHz
Stereo dipole	5° (0.0873 rad)	0.0153	45	≈ 11.2 kHz

The trend is the point: shrinking the span from 30° to 5° pushes the first deep ill-conditioning roughly from below 2 kHz up past 11 kHz, opening a wide, well-conditioned mid-band where robust, flat crosstalk cancellation is achievable with modest regularization. (These single-null estimates are simplified; the real plant has a comb of features and frequency-dependent shadowing, but the scaling of usable bandwidth with $1/\tau$ , and hence with inverse speaker span, is exactly the mechanism Kirkeby and Nelson exploited.)

Loudspeaker placement in practice

A stereo dipole is built from two small, matched loudspeakers placed close together — often in a single shared enclosure — at the same height as the listener's ears, on the median plane, typically 0.5–1.5 m in front. Matching between the two drivers is critical: any difference between $C_{LL}$ and $C_{RR}$ that is not due to head symmetry shows up as residual crosstalk the filters cannot remove. The narrow span gives a second practical benefit beyond bandwidth: because both speakers are near the median plane, the difference in their paths to a moving head changes more slowly, so the sweet spot, while still small, is a little more forgiving laterally than with wide speakers — though, as the next section shows, "a little more forgiving" is still very demanding.

The sweet spot

Why it is so small

Crosstalk cancellation is a precise interference effect: the anti-signal from the opposite speaker must arrive at the far ear with the right delay and amplitude to cancel the leakage. That cancellation is only exact at the position for which the plant $\mathbf{C}$ was measured or modelled — the sweet spot. Move the head, and every path length $r_{ij}$ changes, so every delay and phase in $\mathbf{C}$ changes, but the filter $\mathbf{H}$ is fixed. The cancellation that was a clean null becomes a partial, mistuned subtraction, and the residual crosstalk reappears.

How small is the tolerance? Cancellation of a contralateral path requires phase accuracy to a fraction of a wavelength. At a target upper frequency $f$ , a positional error $\delta$ that changes a path by $\delta$ produces a phase error $2\pi f \delta / c$ . For the null to remain effective the phase error should stay well under, say, a quarter cycle ( $\pi/2$ ), i.e.

\delta \lesssim \frac{c}{8 f}.

At $f = 6$ kHz, $\delta \lesssim 343/(8\times 6000) \approx 7$ mm. At $f = 10$ kHz it is about 4 mm. In other words, to keep high-frequency cancellation intact the listener's head must hold position to within a few millimetres — less than the width of a fingertip. Lower frequencies tolerate more (at 1 kHz, $\delta \lesssim 43$ mm), which is why the high end of XTC collapses first as you lean. A small rotation of the head is just as damaging, because it asymmetrically changes the four paths and breaks the left-right symmetry the filters assume.

A numeric illustration of misalignment

Return to the single-frequency cancellation with $\beta = 0.85$ . Suppose the head shifts so that the cross-path delay used by the fixed filter is now wrong by a phase of $\phi$ at the frequency of interest. The residual after cancellation is proportional to $|1 - e^{-j\phi}| = 2|\sin(\phi/2)|$ times the crosstalk magnitude. With the design giving, say, 20 dB of cancellation at the sweet spot (residual factor 0.1), a lateral shift producing only $\phi = 30°$ ( $0.52$ rad) gives an extra mismatch term $2\sin(15°) = 0.52$ . The cancellation depth collapses from 20 dB toward roughly $20\log_{10}(0.52\times 0.85) \approx -7$ dB — i.e. from near-silence to crosstalk only 7 dB down: the spatial illusion is essentially gone.

One listener, one position

A 30° phase error at 6 kHz corresponds to a path change of just $30/360 \times (343/6000) = 4.8$ mm. This is the quantitative reason transaural is a one-listener, one-position technology.

Head tracking moves the sweet spot

If the sweet spot is small and fixed, the obvious fix is to make it follow the listener. Head-tracked XTC measures the listener's head position and orientation in real time (camera-based face tracking, infrared markers, or inertial sensors) and continuously recomputes the plant $\mathbf{C}(\text{pose})$ and its regularized inverse $\mathbf{H} = \mathbf{C}^{-1}$ , so the cancellation null stays centred on the moving ears. This is the same idea used in head-tracked binaural rendering, applied to the loudspeaker decode rather than the headphone decode.

Head-tracked transaural transforms the technique from a laboratory curiosity into something usable at a desk, because the listener no longer has to sit frozen. The engineering challenges are latency (the filter update must keep pace with head motion or the null lags behind the ears), filter interpolation (switching filters must not click or zipper), and the need for either a measured HRTF set or a parametric head model to generate $\mathbf{C}$ for arbitrary poses. It does not enlarge the instantaneous sweet spot — it relocates it fast enough to stay under the listener. It also remains fundamentally single-listener: a second person sitting beside the tracked listener is in an anti-sweet-spot and hears scrambled, doubly-crosstalked sound.

Spectral colouration and bandwidth limits in practice

Where the colouration comes from

Even at the sweet spot a transaural system rarely sounds perfectly neutral, and the reasons are structural. First, the inverse filter $\mathbf{H}$ equalizes the loudspeaker HRTFs (the plant), but the regularization deliberately leaves the response un-flattened in the ill-conditioned bands, so those bands are coloured by design. Second, the sum and difference channels are equalized to different effective targets, and any error in the assumed plant (a non-ideal room, a head a little larger or smaller than the model) shows as a frequency-dependent gain error — typically a series of peaks and dips, because plant errors interact with the near-nulls of the difference channel to produce comb-like ripple. Third, the binaural signal itself already carries the intended HRTF colour (the source-direction spectral cues), and any spectral mismatch between the HRTF used for authoring and the head doing the listening shows up as timbral shift, exactly as in headphone binaural.

The low- and high-frequency walls

Transaural has natural band edges. At the low end, the stereo dipole radiates inefficiently (dipole roll-off) and the plant is poorly conditioned because head shadowing vanishes and $\beta \to 1$ ; full cancellation there would demand enormous anti-phase excursion. Practical systems therefore relax XTC below a few hundred hertz, where the wavelength is long, interaural cues are weak, and crosstalk matters least perceptually. Below roughly 100–300 Hz one typically lets the system behave like ordinary stereo or mono, which is acoustically benign because localization at those frequencies is dominated by ITD on slowly varying envelopes, not by the fine structure XTC protects.

At the high end, the sweet spot shrinks as $1/f$ (the few-millimetre tolerance computed above), HRTF detail becomes intensely individual, and tiny plant errors cause large phase errors. Robust cancellation usually fades out somewhere between 6 and 12 kHz depending on geometry and regularization. Between these walls lies the usable XTC passband — broadly a few hundred hertz up to several kilohertz — which fortunately overlaps the band most important for the ITD/ILD localization cues described in psychoacoustics.

Quantifying the trade-off

A compact way to summarize practical XTC performance is the achievable channel separation versus spectral flatness as regularization is varied. The table below gives representative orders of magnitude for a stereo-dipole desktop system; exact figures depend on the plant, the room, and the regularization profile.

Design stance	Peak channel separation	Usable XTC bandwidth	Spectral colouration	Robustness to head motion
Aggressive (low regularization)	20–30 dB	narrow, with spikes	severe (deep combs)	very poor (a few mm)
Balanced (shaped regularization)	12–18 dB	~300 Hz – 7 kHz	moderate	poor (~1 cm)
Conservative (high regularization)	6–10 dB	broad but shallow	mild	fair (a few cm)

The practical takeaway: 10–20 dB of separation over a well-chosen mid-band, kept flat enough to avoid obvious colouration, is a realistic and useful target. That is far less than the 60+ dB of isolation headphones provide for free, but it is enough to externalize images and to pull phantom sources well outside the loudspeaker span — which ordinary stereo cannot do at all.

Implementations and use cases

Desktop 3D audio

The stereo dipole's natural home is the desk: a single listener, roughly fixed, a short distance from two closely-spaced loudspeakers or a pair near a monitor. Historically this is the configuration that revived transaural in the 1990s — Cooper and Bauck's "transaural stereo," Gardner's MIT work on "3-D audio using loudspeakers," and the Kirkeby/Nelson stereo dipole all targeted the seated, single-listener case. Modern desktop implementations combine an HRTF-based binaural renderer (the encode) with a regularized XTC filter network (the decode), optionally with camera head-tracking to keep the null on the listener. The payoff is externalized, out-of-head 3-D imagery from two speakers — virtual sources beside and even behind the listener — that flat stereo cannot deliver. DAM Audio's implementation and measurements are documented under XTC / transaural crosstalk cancellation.

Soundbar and TV virtualization

The most widespread consumer use of transaural is the virtual surround soundbar. A soundbar places several drivers in a single cabinet under the television, very close together — a stereo dipole by construction. By feeding the drivers crosstalk-cancelled binaural renders of surround or height channels, a soundbar can place virtual sources well outside its physical width: at the sides for surround channels, and above for height channels in object-based formats. The narrow driver spacing that makes a soundbar visually convenient is exactly the stereo-dipole geometry that makes XTC robust over a wide mid-band, which is why transaural and soundbars are such a natural pairing. The limitation is equally inherent: the effect is strongest for one centrally-seated viewer and degrades for off-axis seats — the sweet-spot problem in the living room. Many products blend XTC virtualization with acoustic beam-steering and wall reflections to spread the effect, trading cancellation depth for a larger, if vaguer, listening area.

Comparison with headphones

It is worth being explicit about why anyone would do this rather than simply hand the listener a pair of headphones, since headphones deliver the binaural encode with essentially perfect channel isolation and no crosstalk problem at all.

Isolation: headphones give 40–60+ dB of channel separation for free; transaural fights to reach 10–20 dB over a limited band. On pure cue fidelity, headphones win decisively.
Externalization and naturalness: loudspeaker reproduction adds the listener's own outer-ear and room cues to the sound arriving from real external sources, which can make transaural images externalize more naturally and avoid the in-head, frontally-collapsed quality that plagues non-individualized headphone binaural. The sound genuinely comes from out there.
Comfort and sharing: no hardware on the head; in principle multiple people can be in the room (though only one is in the sweet spot).
Robustness: headphones are immune to head translation and to the room; transaural is exquisitely sensitive to both.

Key takeaway

The two are best seen as complementary delivery routes for the same binaural encode: the encode (HRTF authoring) is shared, and only the decode differs — direct to the ears for headphones, or through an inverted plant for loudspeakers. Transaural is the choice when you want binaural spatialization without headphones and can accept a single, mostly-stationary listener.

A worked example: crosstalk smear and what XTC restores

To make the whole chain concrete, follow a single binaural front image through (a) naive playback over speakers and (b) XTC playback, and see what is lost and restored.

The intended image

We author a virtual source at $-30°$ azimuth (front-left) by filtering a mono signal $s$ with the corresponding HRTFs. Suppose at 2 kHz the source-direction HRTFs give the near (left) ear a level of $0$ dB and the far (right) ear a level of $-9$ dB with an ITD of $0.28$ ms (left ear leading). So the intended eardrum signals at 2 kHz are

b_L = 1.00\,e^{j 0}, \qquad b_R = 0.355\, e^{-j\,2\pi f\,(0.28\,\text{ms})}.

At $f = 2$ kHz, $2\pi f \times 0.28\,\text{ms} = 2\pi \times 2000 \times 0.00028 = 3.52$ rad $\approx 202°$ . So the target is a left ear leading by a large phase and 9 dB louder than the right — a clear front-left percept.

Naive playback: what the ears actually get

Send $v_L = b_L$ , $v_R = b_R$ straight to loudspeakers at $\pm 5°$ (a stereo dipole geometry) without XTC. The eardrum pressures are $\mathbf{p} = \mathbf{C}\mathbf{b}$ . Using the symmetric plant with, at 2 kHz, ipsilateral $C_i = 1$ (reference) and contralateral $C_c = 0.7\,e^{-j 2\pi f \tau}$ with $\tau = 45\ \mu$ s (from the dipole table). Here $2\pi f \tau = 2\pi\times 2000\times 45\times 10^{-6} = 0.565$ rad $\approx 32°$ , so $C_c = 0.7\,e^{-j32°}$ . The left eardrum receives

p_L = C_i b_L + C_c b_R = (1)(1.00\angle 0°) + (0.7\angle{-32°})(0.355\angle{-202°}).

The second term has magnitude $0.7\times 0.355 = 0.249$ and angle $-234°$ (i.e. $+126°$ ). Adding $1.00\angle 0° + 0.249\angle 126°$ : real part $1.00 + 0.249\cos126° = 1.00 - 0.146 = 0.854$ ; imaginary part $0.249\sin126° = 0.201$ ; magnitude $\sqrt{0.854^2+0.201^2} = 0.877$ , phase $+13°$ . The right eardrum receives

p_R = C_c b_L + C_i b_R = (0.7\angle{-32°})(1.00\angle 0°) + (1)(0.355\angle{-202°}).

First term $0.7\angle{-32°}$ : real $0.594$ , imag $-0.371$ . Second term $0.355\angle{-202°} = 0.355\angle 158°$ : real $-0.329$ , imag $+0.133$ . Sum: real $0.265$ , imag $-0.238$ ; magnitude $0.356$ , phase $-42°$ .

Now compare delivered to intended. Intended interaural level difference at 2 kHz: $20\log_{10}(1.00/0.355) = 9.0$ dB. Delivered ILD: $20\log_{10}(0.877/0.356) = 7.8$ dB — partly preserved here, but the phases are wrecked: intended interaural phase was $0° - (-202°) = +202°$ (a large ITD-bearing difference), whereas delivered interaural phase is $13° - (-42°) = 55°$ . The crosstalk has injected a spurious second arrival at each ear, collapsing the interaural phase difference from $202°$ toward a much smaller value and adding frequency-dependent comb ripple (work the same arithmetic at 1.5 and 3 kHz and the ILD and phase errors swing differently at each frequency — that is the comb colouration). The percept pulls in toward the loudspeakers (front-centre, narrow) and acquires a hollow timbre. The carefully authored front-left image is smeared.

XTC playback: restoring the target

Now insert the regularized inverse $\mathbf{H} = \mathbf{C}^{-1}$ before the speakers, so $\mathbf{v} = \mathbf{H}\mathbf{b}$ and the eardrum pressures become $\mathbf{p} = \mathbf{C}\mathbf{H}\mathbf{b} = \mathbf{b}$ (exactly, in the unregularized ideal; approximately, with regularization). With the closed-form inverse at 2 kHz: $\det = C_i^2 - C_c^2 = 1 - (0.7\angle{-32°})^2 = 1 - 0.49\angle{-64°}$ . Compute $0.49\angle{-64°}$ : real $0.215$ , imag $-0.440$ , so $\det = 0.785 + j0.440 = 0.900\angle 29°$ . The filter applies diagonal gain $C_i/\det = 1/0.900\angle{-29°} = 1.111\angle{-29°}$ and cross gain $-C_c/\det = -0.7\angle{-32°}/0.900\angle29° = -0.778\angle{-61°} = 0.778\angle119°$ .

The point of the algebra is not the intermediate numbers but the outcome: the delivered eardrum signals return to $p_L = 1.00\angle0°$ and $p_R = 0.355\angle{-202°}$ — the intended binaural pair, with the full $202°$ interaural phase and the full $9$ dB ILD restored. The spurious crosstalk arrival has been actively cancelled by the cross term injected into the opposite speaker, and the front-left image snaps back into focus, externalized outside the narrow dipole. The cost we paid is visible in the filter gains ( $1.11$ and $0.78$ here, modest at this benign frequency, but rising toward the singular bands where regularization caps them) and in the few-millimetre sweet spot within which this cancellation holds. That is transaural in one example: crosstalk smears the binaural front image; the inverted plant restores it, for one listener, in one place, over one band.

Limits and pitfalls

Inherent limits

Single listener, single position. The sweet spot is millimetres wide at high frequencies. Only head tracking relocates it, and even then only for the tracked person; bystanders are in anti-sweet-spots.
Limited cancellation depth. Real systems achieve roughly 10–20 dB of separation over a mid-band, not the 40–60 dB of headphones. The illusion is therefore softer and more fragile than headphone binaural.
Band-limited. XTC fades below a few hundred hertz (dipole roll-off, $\beta\to1$ ) and above 6–12 kHz (millimetre sweet spot, individual HRTFs). The protected band is mid-range.
Room-dependent. The plant model usually assumes the direct paths only; strong early reflections add un-modelled crosstalk that the filters cannot cancel and that re-smears the image. Transaural wants a relatively dead, reflection-controlled listening position — see reverberation and direct, diffuse and envelopment.
HRTF individualization. The binaural encode and the plant both depend on the listener's anatomy; non-individual HRTFs cause front-back confusion and timbral error, just as in headphone binaural.

Common mistakes

Over-aggressive regularization tuning. Chasing maximum separation with tiny $\beta_{\text{reg}}$ yields a measurement that looks great and a listening experience that is brittle, coloured, and exhausting. Shaped, frequency-dependent regularization beats a flat aggressive setting.
Mismatched loudspeakers. XTC assumes plant symmetry; two drivers that differ by even 1 dB or a few microseconds inject residual crosstalk the filters cannot remove. Match and time-align the pair.
Wide speaker placement. Using widely-spaced stereo speakers ( $\pm30°$ ) for XTC starts the ill-conditioning combs below 2 kHz and demands heavy regularization. If transaural is the goal, narrow the span toward a stereo dipole.
Ignoring the room. Placing the dipole near a hard desk surface or a wall introduces a strong reflection that acts as a third, un-cancelled source. Treat the first reflection points or pull the speakers away from boundaries.
No head tracking, then blaming the algorithm. Much disappointing transaural is simply a listener sitting a few centimetres off the design point. Either fix the seating precisely or track the head.
Forgetting it is still binaural. Transaural cannot rescue a bad binaural encode; garbage HRTFs in, garbage spatial image out. The decode only preserves what the encode provides.

When to choose transaural

Choose transaural when you need binaural-style 3-D imagery without headphones, for one listener who can stay roughly put: desktop 3-D audio, a personal monitoring station, a single-seat demo, or a soundbar's virtual surround. Prefer headphones when isolation and robustness matter more than going hands-free. Prefer amplitude-panning methods — stereo, amplitude panning, surround — or Ambisonics and wave field synthesis when you need a large listening area or many listeners, since those tolerate movement that XTC cannot. And remember the unifying view: transaural shares the binaural encode with headphone playback and differs only in the decode — an inverted acoustic plant standing between the loudspeakers and the ears. Related formats and their channel/object structure are catalogued in formats, and the broader case that even two channels already carry spatial information is made in stereo is already spatial.

References

Atal, B. S., and Schroeder, M. R. (1966). "Apparent Sound Source Translator." U.S. Patent 3,236,949 — the originating concept of loudspeaker crosstalk cancellation.
Bauck, J., and Cooper, D. H. (1996). "Generalized Transaural Stereo and Applications." Journal of the Audio Engineering Society, 44(9), 683–705.
Cooper, D. H., and Bauck, J. L. (1989). "Prospects for Transaural Recording." Journal of the Audio Engineering Society, 37(1/2), 3–19.
Kirkeby, O., Nelson, P. A., and Hamada, H. (1998). "The 'Stereo Dipole': A Virtual Source Imaging System Using Two Closely Spaced Loudspeakers." Journal of the Audio Engineering Society, 46(5), 387–395.
Nelson, P. A., Hamada, H., and Elliott, S. J. (1992). "Adaptive Inverse Filters for Stereophonic Sound Reproduction." IEEE Transactions on Signal Processing, 40(7), 1621–1632.
Kirkeby, O., and Nelson, P. A. (1999). "Digital Filter Design for Inversion Problems in Sound Reproduction." Journal of the Audio Engineering Society, 47(7/8), 583–595.
Gardner, W. G. (1998). 3-D Audio Using Loudspeakers. Boston: Kluwer Academic Publishers.
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (Revised ed.). Cambridge, MA: MIT Press.
Møller, H. (1992). "Fundamentals of Binaural Technology." Applied Acoustics, 36(3–4), 171–218.

← Back to Spatialization Techniques

The problem: crosstalk destroys the binaural cues​

What binaural assumes​

What loudspeakers actually do​

The goal stated precisely​

The acoustic transfer matrix​

The 2×2 plant​

Symmetry and the ideal listening geometry​

A first-principles model of the two paths​

Crosstalk cancellation: inverting the plant​

The inverse filter​

The recursive cancellation picture​

A worked conceptual example​

Ill-conditioning and regularization​

Why the inverse blows up​

The condition number​

Regularization: trading cancellation for robustness​

The bandwidth/colouration trade-off​

The stereo dipole (Kirkeby and Nelson)​

The insight: bring the speakers together​

Why "dipole"​

Geometry and a numeric comparison​

Loudspeaker placement in practice​

The sweet spot​

Why it is so small​

A numeric illustration of misalignment​

Head tracking moves the sweet spot​

Spectral colouration and bandwidth limits in practice​

Where the colouration comes from​

The low- and high-frequency walls​

Quantifying the trade-off​

Implementations and use cases​

Desktop 3D audio​

Soundbar and TV virtualization​

Comparison with headphones​

A worked example: crosstalk smear and what XTC restores​

The intended image​

Naive playback: what the ears actually get​

XTC playback: restoring the target​

Limits and pitfalls​

Inherent limits​

Common mistakes​

When to choose transaural​

References​