Skip to main content

Transaural - Binaural over Loudspeakers

Transaural reproduction is the art of delivering a binaural signal — a pair of pressures designed to arrive, one each, at the listener's two eardrums — using ordinary loudspeakers instead of headphones. It sounds like it should be impossible. Headphones work precisely because each transducer is acoustically sealed to one ear: the left earpiece feeds the left eardrum and essentially nothing else. Loudspeakers have no such isolation. Each speaker radiates into a room and reaches both ears, and the two unwanted paths — left speaker to right ear, right speaker to left ear — are exactly what a binaural signal must not have. Transaural systems solve this by anticipating the leakage and pre-cancelling it: the loudspeaker feeds are computed so that, after the room has done its crosstalk mixing, what remains at each eardrum is the intended binaural signal and only that.

This is the same encode-then-decode structure that organizes every technique in this part of the guide. The encode is binaural authoring: a signal pair carrying the head-related cues for a desired direction, exactly as in binaural reproduction. The decode is new and specific to transaural: a crosstalk-cancellation (XTC) filter network that inverts the acoustic transfer matrix from the loudspeakers to the ears, so the binaural pair survives the trip through the air. Understanding transaural therefore means understanding that transfer matrix, why its inverse is numerically dangerous, and what compromises — regularization, narrow speaker spacing, head tracking — make the dangerous inverse usable in a real room.

This chapter builds the theory from first principles, works through the algebra of a 2×2 cancellation with numbers, then turns to the engineering realities: ill-conditioning, the stereo dipole of Kirkeby and Nelson, the tiny sweet spot, spectral colouration, and the practical systems (desktop 3D audio, soundbar virtualization) that ship transaural today. DAM Audio's work on this topic is collected under XTC / transaural crosstalk cancellation.

The problem: crosstalk destroys the binaural cues

What binaural assumes

A binaural signal is a pair of time- and frequency-dependent pressures, bL(t)b_L(t) and bR(t)b_R(t), constructed so that if bLb_L reaches only the left eardrum and bRb_R reaches only the right eardrum, the listener hears a virtual source at the intended location. The construction is direction-dependent filtering by head-related transfer functions (HRTFs). For a source at direction θ\theta feeding a monophonic signal ss, the binaural pair is

bL=HL(θ)s,bR=HR(θ)s,b_L = H_L(\theta)\, s, \qquad b_R = H_R(\theta)\, s,

where HLH_L and HRH_R are the left- and right-ear HRTFs. The information the brain uses — the interaural time difference (ITD), the interaural level difference (ILD), and the spectral notches and peaks of the pinna — is encoded entirely in the difference and detail of this pair. The fundamentals are developed in psychoacoustics and the authoring in binaural. The crucial property for this chapter is the channel-to-ear isolation assumption baked into the equations above: bLb_L \to left ear, bRb_R \to right ear, with no mixing. Headphones honour that assumption. Loudspeakers, by their nature, violate it.

What loudspeakers actually do

Place two loudspeakers in front of a listener. Drive the left speaker with some signal vLv_L and the right with vRv_R. The pressure at the left eardrum is not vLv_L. It is the sum of two contributions: the ipsilateral path from the left speaker (short, direct, same side) and the contralateral path from the right speaker (longer, around the head, opposite side). The same is true at the right ear. In symbols, with CijC_{ij} denoting the acoustic transfer function from speaker jj to ear ii,

pL=CLLvL+CLRvR,pR=CRLvL+CRRvR.p_L = C_{LL}\, v_L + C_{LR}\, v_R, \qquad p_R = C_{RL}\, v_L + C_{RR}\, v_R.

The two unwanted terms, CLRvRC_{LR} v_R and CRLvLC_{RL} v_L, are the crosstalk. They are not small. For loudspeakers at ±30°\pm 30° the extra path length to the far ear is on the order of 15–20 cm, the level is only a few decibels down from the direct path at low and mid frequencies, and the head shadow that attenuates the far-ear signal is significant only above roughly 1.5 kHz. So if you simply send a binaural signal to two loudspeakers — vL=bLv_L = b_L, vR=bRv_R = b_R — each eardrum receives its intended signal plus a delayed, head-filtered copy of the other channel.

Crosstalk collapses the image

That contamination is catastrophic for the binaural cues. The ITD encoded in bLb_L versus bRb_R relies on a clean comparison of arrival times at the two ears; the crosstalk superimposes a second arrival at each ear, with its own delay set by the loudspeaker geometry, not by the intended source direction. The ILD is similarly corrupted, and the pinna spectral cues are smeared by comb filtering between the direct and crosstalk paths. The result is a collapsed, frontally-locked, comb-coloured image — the familiar "phantom-centre plus narrow stage" of ordinary stereo, not the open spatial scene the binaural signal described. Crosstalk does not merely degrade binaural-over-speakers; it reduces it to conventional stereo.

The goal stated precisely

Transaural's objective is to choose loudspeaker feeds vL,vRv_L, v_R such that the eardrum pressures equal the binaural target:

pL=bL,pR=bR.p_L = b_L, \qquad p_R = b_R.

Because the air imposes the mixing p=Cv\mathbf{p} = \mathbf{C}\,\mathbf{v}, the feeds we must send are v=C1b\mathbf{v} = \mathbf{C}^{-1}\mathbf{b}. Everything that follows — the conditioning problems, the regularization, the dipole geometry, the tiny sweet spot — flows from the difficulty of building and using that inverse C1\mathbf{C}^{-1} in a real acoustic system. The encode is binaural; the decode is the inverse of the room's own crosstalk.

The acoustic transfer matrix

The 2×2 plant

Collect the four paths into a frequency-domain matrix. At each frequency ω\omega,

C(ω)=[CLLCLRCRLCRR],p=Cv.\mathbf{C}(\omega) = \begin{bmatrix} C_{LL} & C_{LR} \\ C_{RL} & C_{RR} \end{bmatrix}, \qquad \mathbf{p} = \mathbf{C}\,\mathbf{v}.

This C\mathbf{C} is called the plant matrix or the acoustic transfer matrix. Its diagonal entries CLL,CRRC_{LL}, C_{RR} are the ipsilateral (direct) paths; its off-diagonal entries CLR,CRLC_{LR}, C_{RL} are the contralateral (crosstalk) paths. Each entry is itself a complete transfer function: a delay (the time of flight from speaker to ear), an amplitude (governed by distance attenuation 1/r\propto 1/r and head shadowing), and a complex frequency response (the HRTF for the loudspeaker's direction relative to the head, including pinna and torso effects). In other words, the plant entries are themselves HRTFs — specifically, the HRTFs for the loudspeaker positions.

Symmetry and the ideal listening geometry

For a listener seated symmetrically between two mirror-image loudspeakers (the canonical equilateral or near-equilateral arrangement), the geometry is left-right symmetric. Then the ipsilateral paths are equal, CLL=CRRCiC_{LL} = C_{RR} \equiv C_i, and the contralateral paths are equal, CLR=CRLCcC_{LR} = C_{RL} \equiv C_c. The plant simplifies to a symmetric matrix

C=[CiCcCcCi],\mathbf{C} = \begin{bmatrix} C_i & C_c \\ C_c & C_i \end{bmatrix},

with CiC_i the ipsilateral (direct, near-ear) response and CcC_c the contralateral (crosstalk, far-ear) response. This symmetry is enormously convenient: it lets us diagonalize the problem into a sum channel and a difference channel, which is the standard way to analyze and to implement XTC, and it makes the conditioning behaviour transparent. We will use it throughout, returning to the asymmetric case only when discussing head tracking and off-centre listeners.

A first-principles model of the two paths

To get intuition and numbers, model the ipsilateral and contralateral paths by their dominant features: a delay and a level. Let the direct path from a speaker to its near ear have length rir_i and the cross path to the far ear have length rc=ri+Δrr_c = r_i + \Delta r, where Δr\Delta r is the extra distance the sound travels around the head. With sound speed c343c \approx 343 m/s, the interaural path delay for the loudspeaker is

τ=Δrc.\tau = \frac{\Delta r}{c}.

A spherical-head approximation for a source at azimuth θ\theta from the median plane gives the classic Woodworth extra-path estimate

Δra(θ+sinθ),\Delta r \approx a\,(\theta + \sin\theta),

with a0.0875a \approx 0.0875 m the effective head radius and θ\theta in radians. For loudspeakers at θ=30°=0.524\theta = 30° = 0.524 rad: sin30°=0.5\sin 30° = 0.5, so Δr0.0875(0.524+0.5)=0.0896\Delta r \approx 0.0875\,(0.524 + 0.5) = 0.0896 m, about 9 cm, giving

τ=0.0896343261 μs.\tau = \frac{0.0896}{343} \approx 261\ \mu\text{s}.

The level difference between the cross and direct paths has two parts: spreading loss 1/r1/r and head shadowing. Spreading alone for ri1.0r_i \approx 1.0 m and rc1.09r_c \approx 1.09 m is only 20log10(1.0/1.09)0.720\log_{10}(1.0/1.09) \approx -0.7 dB — negligible. Head shadowing is frequency-dependent and dominates above ~1.5 kHz, reaching 5–15 dB at high frequencies but near 0 dB at low frequencies. So at low frequencies the crosstalk arrives only ~0.7 dB quieter than the direct sound and ~0.26 ms later: a strong, nearly-equal interfering copy. This is precisely why crosstalk is so destructive, and why cancelling it is hard at low frequencies and at high frequencies for opposite reasons, as the conditioning analysis will show.

Crosstalk cancellation: inverting the plant

The inverse filter

We want p=b\mathbf{p} = \mathbf{b} with p=Cv\mathbf{p} = \mathbf{C}\mathbf{v}, so the crosstalk-cancellation filter network is the matrix inverse of the plant:

v=Hb,H=C1.\mathbf{v} = \mathbf{H}\,\mathbf{b}, \qquad \mathbf{H} = \mathbf{C}^{-1}.

H\mathbf{H} is a 2×2 matrix of filters applied to the binaural pair before sending it to the loudspeakers. For the symmetric plant, the inverse has a clean closed form. With detC=Ci2Cc2\det \mathbf{C} = C_i^2 - C_c^2,

H=C1=1Ci2Cc2[CiCcCcCi].\mathbf{H} = \mathbf{C}^{-1} = \frac{1}{C_i^2 - C_c^2} \begin{bmatrix} C_i & -C_c \\ -C_c & C_i \end{bmatrix}.

Read this physically. The diagonal term Ci/(Ci2Cc2)C_i/(C_i^2 - C_c^2) is a frequency-shaping filter that equalizes the direct path. The off-diagonal term Cc/(Ci2Cc2)-C_c/(C_i^2 - C_c^2) is the active cancellation: into the opposite loudspeaker it injects a signal that is the negative of the crosstalk, scaled and filtered so that when it arrives at the far ear it destructively cancels the unwanted leakage. The "−" sign is the whole idea: to silence the contralateral leakage of the left channel into the right ear, drive the right speaker with an inverted, appropriately delayed and filtered anti-signal.

The recursive cancellation picture

There is a beautifully intuitive way to see crosstalk cancellation that predates the matrix formulation and explains the convergence. Suppose we want only bLb_L at the left ear and silence at the right ear. Send bLb_L from the left speaker. It reaches the left ear (good) but also leaks to the right ear via Cc/CiC_c/C_i relative to the direct. To cancel that leakage, send from the right speaker a signal (Cc/Ci)bL-(C_c/C_i)\,b_L, time-aligned and filtered, so it arrives at the right ear and cancels. But that cancelling signal itself leaks back to the left ear, where it is unwanted, so we must add a correction from the left speaker to cancel that, which leaks again to the right ear, and so on. Each round trip multiplies the residual by the crosstalk-to-direct ratio

β(ω)=Cc(ω)Ci(ω),\beta(\omega) = \frac{C_c(\omega)}{C_i(\omega)},

so the total cancellation is a geometric series

1β+β2β3+=11+β.1 - \beta + \beta^2 - \beta^3 + \cdots = \frac{1}{1 + \beta}.

The series converges whenever β<1|\beta| < 1, i.e. whenever the crosstalk is weaker than the direct sound — which is true for sensible geometries. Summing the geometric series recovers exactly the matrix inverse above. This recursive view is due in spirit to Atal and Schroeder and was the basis of the earliest analog cancellers; it makes plain that XTC is fundamentally an acoustic feedback cancellation whose convergence rate depends on how close the crosstalk level is to the direct level.

A worked conceptual example

Take a single frequency where we model the paths by magnitude and delay only. Let the direct path be unit gain at zero reference delay, and the crosstalk be β=0.85\beta = 0.85 in magnitude with the extra delay τ=0.26\tau = 0.26 ms computed earlier; at this frequency suppose the phase of β\beta is such that we can treat it as a real attenuation for illustration. We want bL=1b_L = 1 at the left ear and 00 at the right ear.

Round 0: left speaker emits 11. Left ear receives 11 (target met); right ear receives β=0.85\beta = 0.85 (unwanted).

Round 1: right speaker emits 0.85-0.85 to cancel. Right ear now receives 0.850.85=00.85 - 0.85 = 0; but this signal leaks to the left ear as 0.85×0.85=0.7225-0.85 \times 0.85 = -0.7225, corrupting the target.

Round 2: left speaker adds +0.7225+0.7225 to restore the left ear; this leaks to the right ear as +0.7225×0.85=+0.614+0.7225 \times 0.85 = +0.614, breaking the cancellation again.

Continuing, the left-ear signal is 10.7225+0.522=1/(1β2)×(direct terms)1 - 0.7225 + 0.522 - \cdots = 1/(1-\beta^2) \times \text{(direct terms)} and the residual at the right ear tends to zero. The net loudspeaker drive required is the matrix-inverse solution. Plugging Ci=1C_i = 1, Cc=0.85C_c = 0.85 into the closed form, det=10.7225=0.2775\det = 1 - 0.7225 = 0.2775, so the diagonal filter gain is 1/0.2775=3.601/0.2775 = 3.60 and the off-diagonal is 0.85/0.2775=3.06-0.85/0.2775 = -3.06. To play a binaural pair through this geometry we must therefore boost the loudspeaker drive by about 20log10(3.60)1120\log_{10}(3.60) \approx 11 dB relative to the binaural signal, and inject an anti-phase cross term nearly as large. That large boost — the inverse of a small determinant Ci2Cc2C_i^2 - C_c^2 when CcC_c approaches CiC_i — is the seed of every practical problem with transaural, and it is the subject of the next section.

Ill-conditioning and regularization

Why the inverse blows up

The cancellation gain scales with 1/(Ci2Cc2)=1/(Ci2(1β2))1/(C_i^2 - C_c^2) = 1/(C_i^2(1 - \beta^2)). Whenever β1|\beta| \to 1 — the crosstalk approaches the direct path in magnitude and aligns in phase — the determinant approaches zero and the inverse filter demands enormous gain.

Ill-conditioning amplifies every error

This is ill-conditioning: small acoustic quantities in the denominator produce huge filter responses, and tiny errors in the measured or modelled plant (a head that moved a centimetre, a room reflection, an HRTF mismatch) get amplified by the same huge factor.

Crucially, β(ω)\beta(\omega) is frequency-dependent, and there are specific frequencies where the direct and cross paths interfere most severely. Because the crosstalk is delayed by τ\tau, the relative phase between direct and cross is ωτ\omega \tau, which sweeps through 0,π,2π,0, \pi, 2\pi, \ldots as frequency rises. At frequencies where the contralateral path is in phase with what the inverse needs to cancel and nearly equal in magnitude, the plant is near-singular and the required boost spikes. Equivalently, in the sum/difference decomposition the plant's two eigen-responses are Ci+CcC_i + C_c (sum channel) and CiCcC_i - C_c (difference channel); the difference channel has deep nulls wherever CiCcC_i \approx C_c, and inverting a null means infinite gain. These nulls occur roughly at frequencies where the wavelength makes the head-around path interfere destructively — concentrated at low frequencies (where β1\beta \to 1 because head shadowing vanishes) and at a comb of mid/high frequencies set by τ\tau.

The condition number

The severity is quantified by the condition number of C\mathbf{C}, the ratio of its largest to smallest singular value:

κ(C)=σmaxσmin=Ci+CcCiCc(symmetric case).\kappa(\mathbf{C}) = \frac{\sigma_{\max}}{\sigma_{\min}} = \left| \frac{C_i + C_c}{C_i - C_c} \right| \quad \text{(symmetric case)}.

When κ\kappa is large the inversion is unstable: the system pours energy into the difference channel to no audible benefit, amplifies measurement error, and produces wild loudspeaker excursions. When κ1\kappa \approx 1 the plant is well-behaved. Robust XTC design is essentially the project of keeping κ(ω)\kappa(\omega) small across as wide a band as possible — and the two main levers are regularization (this section) and loudspeaker geometry (the stereo dipole, next section).

Regularization: trading cancellation for robustness

The cure for an explosive inverse is to refuse to fully invert near the singularities. Instead of C1\mathbf{C}^{-1} we use a regularized inverse, the Tikhonov / least-squares solution

H=(CHC+βreg(ω)I)1CH,\mathbf{H} = \big(\mathbf{C}^{H}\mathbf{C} + \beta_{\text{reg}}(\omega)\,\mathbf{I}\big)^{-1}\mathbf{C}^{H},

where CH\mathbf{C}^H is the conjugate transpose and βreg(ω)0\beta_{\text{reg}}(\omega) \ge 0 is a regularization parameter (a small "effort penalty"). The term βregI\beta_{\text{reg}}\mathbf{I} keeps the matrix being inverted away from singularity: where the plant is well-conditioned and σminβreg\sigma_{\min} \gg \beta_{\text{reg}}, the regularized inverse equals the true inverse and cancellation is full; where the plant is near-singular and σmin0\sigma_{\min} \to 0, the penalty dominates, the gain is capped at roughly 1/βreg1/\sqrt{\beta_{\text{reg}}}, and the system gracefully gives up trying to cancel rather than blowing up.

Equivalently, each singular value σk\sigma_k of the plant is inverted not as 1/σk1/\sigma_k but as the filtered version

σkσk2+βreg,\frac{\sigma_k}{\sigma_k^2 + \beta_{\text{reg}}},

which behaves like 1/σk1/\sigma_k for large σk\sigma_k and like σk/βreg0\sigma_k/\beta_{\text{reg}} \to 0 for small σk\sigma_k. The penalty thus rolls off the inversion exactly in the troublesome bands. This is mathematically identical to the constrained-optimization view of Kirkeby and Nelson, in which one minimizes the reproduction error subject to a penalty on loudspeaker effort, with βreg\beta_{\text{reg}} the Lagrange multiplier balancing the two.

The bandwidth/colouration trade-off

Regularization is not free. Wherever the penalty suppresses the inverse, two things happen: the crosstalk is not fully cancelled there (loss of channel separation, hence weakened spatial cues in that band), and the frequency response of the delivered signal departs from flat (audible colouration). The design choice is therefore a trade-off:

  • Small βreg\beta_{\text{reg}}: maximal cancellation and the widest theoretical separation, but huge gains at the singular frequencies, severe colouration, large speaker excursion, and extreme sensitivity to head movement and plant error. Brittle.
  • Large βreg\beta_{\text{reg}}: robust, flatter, gentler on the speakers, but reduced cancellation depth and narrower usable bandwidth where the XTC actually works.
Rule of thumb

A common refinement is to make βreg(ω)\beta_{\text{reg}}(\omega) frequency-dependent — small in the mid-band where the plant is naturally well-conditioned and cancellation is most perceptually valuable, larger at the low end and at the high-frequency singular combs where full inversion would be ruinous.

This shapes the achievable XTC into a usable mid-band passband flanked by deliberately under-corrected extremes. The figure of merit Kirkeby and others use is the trade-off between channel separation (how many dB of crosstalk are removed) and spectral flatness / dynamic range; you cannot maximize both, and the regularization profile is where the engineer chooses the balance. The next section shows how geometry can shift the whole trade-off in the designer's favour, so that less regularization is needed.

The stereo dipole (Kirkeby and Nelson)

The insight: bring the speakers together

The singular frequencies of the plant are set by the interaural path delay τ\tau, which is set by the angular separation of the loudspeakers. Wide-spaced speakers (e.g. ±30°\pm 30°) give a large τ\tau and therefore a dense comb of ill-conditioned frequencies starting low; the difference channel CiCcC_i - C_c nulls early and often, so robust cancellation is confined to a narrow band. Kirkeby, Nelson, Hamada and Orduña-Bustamante's key realization (mid-1990s) was that if you move the two loudspeakers close together, subtending a small angle at the listener — they proposed about ±5°\pm 5°, a total span of roughly 10° — the geometry changes the conditioning dramatically. They named this closely-spaced pair the stereo dipole.

With a small span, the extra path Δr\Delta r to the far ear shrinks, τ\tau shrinks, and the first difference-channel null is pushed up in frequency. The plant stays well-conditioned over a much wider band, so the inverse needs far less regularization to be robust, and the usable XTC bandwidth widens. The cost is borne mostly at low frequencies (discussed below), but across the perceptually critical mid-band the closely-spaced pair is markedly better-behaved than a wide pair.

Why "dipole"

At low frequencies the two closely-spaced speakers are driven largely in anti-phase by the cancellation filters (the difference channel dominates the cross-cancellation), so the pair radiates like an acoustic dipole — two nearly-coincident opposite sources. A dipole's far-field pressure rises with frequency (its radiation is weak at low frequencies, where the opposite sources cancel each other in the far field), which is the physical face of the same low-frequency difficulty seen in the conditioning: delivering separation at low frequencies from closely-spaced sources demands large, opposed excursions for little acoustic output. The name captures both the geometry (two close sources) and the radiation behaviour (dipolar at low frequency).

Geometry and a numeric comparison

Compare the interaural path delay for a wide pair and a stereo dipole using Δra(θ+sinθ)\Delta r \approx a(\theta + \sin\theta) with a=0.0875a = 0.0875 m, then the first difference-channel null near f11/(2τ)f_1 \approx 1/(2\tau) (the lowest frequency at which the cross path is a half-cycle out of step over the band, a rough but instructive estimate).

ConfigurationHalf-angle θ\thetaΔr\Delta r (m)τ\tau (μ\mus)First null estimate f11/(2τ)f_1 \approx 1/(2\tau)
Wide stereo30° (0.524 rad)0.0896261≈ 1.9 kHz
Standard narrow15° (0.262 rad)0.0459134≈ 3.7 kHz
Stereo dipole5° (0.0873 rad)0.015345≈ 11.2 kHz

The trend is the point: shrinking the span from 30° to 5° pushes the first deep ill-conditioning roughly from below 2 kHz up past 11 kHz, opening a wide, well-conditioned mid-band where robust, flat crosstalk cancellation is achievable with modest regularization. (These single-null estimates are simplified; the real plant has a comb of features and frequency-dependent shadowing, but the scaling of usable bandwidth with 1/τ1/\tau, and hence with inverse speaker span, is exactly the mechanism Kirkeby and Nelson exploited.)

Loudspeaker placement in practice

A stereo dipole is built from two small, matched loudspeakers placed close together — often in a single shared enclosure — at the same height as the listener's ears, on the median plane, typically 0.5–1.5 m in front. Matching between the two drivers is critical: any difference between CLLC_{LL} and CRRC_{RR} that is not due to head symmetry shows up as residual crosstalk the filters cannot remove. The narrow span gives a second practical benefit beyond bandwidth: because both speakers are near the median plane, the difference in their paths to a moving head changes more slowly, so the sweet spot, while still small, is a little more forgiving laterally than with wide speakers — though, as the next section shows, "a little more forgiving" is still very demanding.

The sweet spot

Why it is so small

Crosstalk cancellation is a precise interference effect: the anti-signal from the opposite speaker must arrive at the far ear with the right delay and amplitude to cancel the leakage. That cancellation is only exact at the position for which the plant C\mathbf{C} was measured or modelled — the sweet spot. Move the head, and every path length rijr_{ij} changes, so every delay and phase in C\mathbf{C} changes, but the filter H\mathbf{H} is fixed. The cancellation that was a clean null becomes a partial, mistuned subtraction, and the residual crosstalk reappears.

How small is the tolerance? Cancellation of a contralateral path requires phase accuracy to a fraction of a wavelength. At a target upper frequency ff, a positional error δ\delta that changes a path by δ\delta produces a phase error 2πfδ/c2\pi f \delta / c. For the null to remain effective the phase error should stay well under, say, a quarter cycle (π/2\pi/2), i.e.

δc8f.\delta \lesssim \frac{c}{8 f}.

At f=6f = 6 kHz, δ343/(8×6000)7\delta \lesssim 343/(8\times 6000) \approx 7 mm. At f=10f = 10 kHz it is about 4 mm. In other words, to keep high-frequency cancellation intact the listener's head must hold position to within a few millimetres — less than the width of a fingertip. Lower frequencies tolerate more (at 1 kHz, δ43\delta \lesssim 43 mm), which is why the high end of XTC collapses first as you lean. A small rotation of the head is just as damaging, because it asymmetrically changes the four paths and breaks the left-right symmetry the filters assume.

A numeric illustration of misalignment

Return to the single-frequency cancellation with β=0.85\beta = 0.85. Suppose the head shifts so that the cross-path delay used by the fixed filter is now wrong by a phase of ϕ\phi at the frequency of interest. The residual after cancellation is proportional to 1ejϕ=2sin(ϕ/2)|1 - e^{-j\phi}| = 2|\sin(\phi/2)| times the crosstalk magnitude. With the design giving, say, 20 dB of cancellation at the sweet spot (residual factor 0.1), a lateral shift producing only ϕ=30°\phi = 30° (0.520.52 rad) gives an extra mismatch term 2sin(15°)=0.522\sin(15°) = 0.52. The cancellation depth collapses from 20 dB toward roughly 20log10(0.52×0.85)720\log_{10}(0.52\times 0.85) \approx -7 dB — i.e. from near-silence to crosstalk only 7 dB down: the spatial illusion is essentially gone.

One listener, one position

A 30° phase error at 6 kHz corresponds to a path change of just 30/360×(343/6000)=4.830/360 \times (343/6000) = 4.8 mm. This is the quantitative reason transaural is a one-listener, one-position technology.

Head tracking moves the sweet spot

If the sweet spot is small and fixed, the obvious fix is to make it follow the listener. Head-tracked XTC measures the listener's head position and orientation in real time (camera-based face tracking, infrared markers, or inertial sensors) and continuously recomputes the plant C(pose)\mathbf{C}(\text{pose}) and its regularized inverse H=C1\mathbf{H} = \mathbf{C}^{-1}, so the cancellation null stays centred on the moving ears. This is the same idea used in head-tracked binaural rendering, applied to the loudspeaker decode rather than the headphone decode.

Head-tracked transaural transforms the technique from a laboratory curiosity into something usable at a desk, because the listener no longer has to sit frozen. The engineering challenges are latency (the filter update must keep pace with head motion or the null lags behind the ears), filter interpolation (switching filters must not click or zipper), and the need for either a measured HRTF set or a parametric head model to generate C\mathbf{C} for arbitrary poses. It does not enlarge the instantaneous sweet spot — it relocates it fast enough to stay under the listener. It also remains fundamentally single-listener: a second person sitting beside the tracked listener is in an anti-sweet-spot and hears scrambled, doubly-crosstalked sound.

Spectral colouration and bandwidth limits in practice

Where the colouration comes from

Even at the sweet spot a transaural system rarely sounds perfectly neutral, and the reasons are structural. First, the inverse filter H\mathbf{H} equalizes the loudspeaker HRTFs (the plant), but the regularization deliberately leaves the response un-flattened in the ill-conditioned bands, so those bands are coloured by design. Second, the sum and difference channels are equalized to different effective targets, and any error in the assumed plant (a non-ideal room, a head a little larger or smaller than the model) shows as a frequency-dependent gain error — typically a series of peaks and dips, because plant errors interact with the near-nulls of the difference channel to produce comb-like ripple. Third, the binaural signal itself already carries the intended HRTF colour (the source-direction spectral cues), and any spectral mismatch between the HRTF used for authoring and the head doing the listening shows up as timbral shift, exactly as in headphone binaural.

The low- and high-frequency walls

Transaural has natural band edges. At the low end, the stereo dipole radiates inefficiently (dipole roll-off) and the plant is poorly conditioned because head shadowing vanishes and β1\beta \to 1; full cancellation there would demand enormous anti-phase excursion. Practical systems therefore relax XTC below a few hundred hertz, where the wavelength is long, interaural cues are weak, and crosstalk matters least perceptually. Below roughly 100–300 Hz one typically lets the system behave like ordinary stereo or mono, which is acoustically benign because localization at those frequencies is dominated by ITD on slowly varying envelopes, not by the fine structure XTC protects.

At the high end, the sweet spot shrinks as 1/f1/f (the few-millimetre tolerance computed above), HRTF detail becomes intensely individual, and tiny plant errors cause large phase errors. Robust cancellation usually fades out somewhere between 6 and 12 kHz depending on geometry and regularization. Between these walls lies the usable XTC passband — broadly a few hundred hertz up to several kilohertz — which fortunately overlaps the band most important for the ITD/ILD localization cues described in psychoacoustics.

Quantifying the trade-off

A compact way to summarize practical XTC performance is the achievable channel separation versus spectral flatness as regularization is varied. The table below gives representative orders of magnitude for a stereo-dipole desktop system; exact figures depend on the plant, the room, and the regularization profile.

Design stancePeak channel separationUsable XTC bandwidthSpectral colourationRobustness to head motion
Aggressive (low regularization)20–30 dBnarrow, with spikessevere (deep combs)very poor (a few mm)
Balanced (shaped regularization)12–18 dB~300 Hz – 7 kHzmoderatepoor (~1 cm)
Conservative (high regularization)6–10 dBbroad but shallowmildfair (a few cm)

The practical takeaway: 10–20 dB of separation over a well-chosen mid-band, kept flat enough to avoid obvious colouration, is a realistic and useful target. That is far less than the 60+ dB of isolation headphones provide for free, but it is enough to externalize images and to pull phantom sources well outside the loudspeaker span — which ordinary stereo cannot do at all.

Implementations and use cases

Desktop 3D audio

The stereo dipole's natural home is the desk: a single listener, roughly fixed, a short distance from two closely-spaced loudspeakers or a pair near a monitor. Historically this is the configuration that revived transaural in the 1990s — Cooper and Bauck's "transaural stereo," Gardner's MIT work on "3-D audio using loudspeakers," and the Kirkeby/Nelson stereo dipole all targeted the seated, single-listener case. Modern desktop implementations combine an HRTF-based binaural renderer (the encode) with a regularized XTC filter network (the decode), optionally with camera head-tracking to keep the null on the listener. The payoff is externalized, out-of-head 3-D imagery from two speakers — virtual sources beside and even behind the listener — that flat stereo cannot deliver. DAM Audio's implementation and measurements are documented under XTC / transaural crosstalk cancellation.

Soundbar and TV virtualization

The most widespread consumer use of transaural is the virtual surround soundbar. A soundbar places several drivers in a single cabinet under the television, very close together — a stereo dipole by construction. By feeding the drivers crosstalk-cancelled binaural renders of surround or height channels, a soundbar can place virtual sources well outside its physical width: at the sides for surround channels, and above for height channels in object-based formats. The narrow driver spacing that makes a soundbar visually convenient is exactly the stereo-dipole geometry that makes XTC robust over a wide mid-band, which is why transaural and soundbars are such a natural pairing. The limitation is equally inherent: the effect is strongest for one centrally-seated viewer and degrades for off-axis seats — the sweet-spot problem in the living room. Many products blend XTC virtualization with acoustic beam-steering and wall reflections to spread the effect, trading cancellation depth for a larger, if vaguer, listening area.

Comparison with headphones

It is worth being explicit about why anyone would do this rather than simply hand the listener a pair of headphones, since headphones deliver the binaural encode with essentially perfect channel isolation and no crosstalk problem at all.

  • Isolation: headphones give 40–60+ dB of channel separation for free; transaural fights to reach 10–20 dB over a limited band. On pure cue fidelity, headphones win decisively.
  • Externalization and naturalness: loudspeaker reproduction adds the listener's own outer-ear and room cues to the sound arriving from real external sources, which can make transaural images externalize more naturally and avoid the in-head, frontally-collapsed quality that plagues non-individualized headphone binaural. The sound genuinely comes from out there.
  • Comfort and sharing: no hardware on the head; in principle multiple people can be in the room (though only one is in the sweet spot).
  • Robustness: headphones are immune to head translation and to the room; transaural is exquisitely sensitive to both.
Key takeaway

The two are best seen as complementary delivery routes for the same binaural encode: the encode (HRTF authoring) is shared, and only the decode differs — direct to the ears for headphones, or through an inverted plant for loudspeakers. Transaural is the choice when you want binaural spatialization without headphones and can accept a single, mostly-stationary listener.

A worked example: crosstalk smear and what XTC restores

To make the whole chain concrete, follow a single binaural front image through (a) naive playback over speakers and (b) XTC playback, and see what is lost and restored.

The intended image

We author a virtual source at 30°-30° azimuth (front-left) by filtering a mono signal ss with the corresponding HRTFs. Suppose at 2 kHz the source-direction HRTFs give the near (left) ear a level of 00 dB and the far (right) ear a level of 9-9 dB with an ITD of 0.280.28 ms (left ear leading). So the intended eardrum signals at 2 kHz are

bL=1.00ej0,bR=0.355ej2πf(0.28ms).b_L = 1.00\,e^{j 0}, \qquad b_R = 0.355\, e^{-j\,2\pi f\,(0.28\,\text{ms})}.

At f=2f = 2 kHz, 2πf×0.28ms=2π×2000×0.00028=3.522\pi f \times 0.28\,\text{ms} = 2\pi \times 2000 \times 0.00028 = 3.52 rad 202°\approx 202°. So the target is a left ear leading by a large phase and 9 dB louder than the right — a clear front-left percept.

Naive playback: what the ears actually get

Send vL=bLv_L = b_L, vR=bRv_R = b_R straight to loudspeakers at ±5°\pm 5° (a stereo dipole geometry) without XTC. The eardrum pressures are p=Cb\mathbf{p} = \mathbf{C}\mathbf{b}. Using the symmetric plant with, at 2 kHz, ipsilateral Ci=1C_i = 1 (reference) and contralateral Cc=0.7ej2πfτC_c = 0.7\,e^{-j 2\pi f \tau} with τ=45 μ\tau = 45\ \mus (from the dipole table). Here 2πfτ=2π×2000×45×106=0.5652\pi f \tau = 2\pi\times 2000\times 45\times 10^{-6} = 0.565 rad 32°\approx 32°, so Cc=0.7ej32°C_c = 0.7\,e^{-j32°}. The left eardrum receives

pL=CibL+CcbR=(1)(1.000°)+(0.732°)(0.355202°).p_L = C_i b_L + C_c b_R = (1)(1.00\angle 0°) + (0.7\angle{-32°})(0.355\angle{-202°}).

The second term has magnitude 0.7×0.355=0.2490.7\times 0.355 = 0.249 and angle 234°-234° (i.e. +126°+126°). Adding 1.000°+0.249126°1.00\angle 0° + 0.249\angle 126°: real part 1.00+0.249cos126°=1.000.146=0.8541.00 + 0.249\cos126° = 1.00 - 0.146 = 0.854; imaginary part 0.249sin126°=0.2010.249\sin126° = 0.201; magnitude 0.8542+0.2012=0.877\sqrt{0.854^2+0.201^2} = 0.877, phase +13°+13°. The right eardrum receives

pR=CcbL+CibR=(0.732°)(1.000°)+(1)(0.355202°).p_R = C_c b_L + C_i b_R = (0.7\angle{-32°})(1.00\angle 0°) + (1)(0.355\angle{-202°}).

First term 0.732°0.7\angle{-32°}: real 0.5940.594, imag 0.371-0.371. Second term 0.355202°=0.355158°0.355\angle{-202°} = 0.355\angle 158°: real 0.329-0.329, imag +0.133+0.133. Sum: real 0.2650.265, imag 0.238-0.238; magnitude 0.3560.356, phase 42°-42°.

Now compare delivered to intended. Intended interaural level difference at 2 kHz: 20log10(1.00/0.355)=9.020\log_{10}(1.00/0.355) = 9.0 dB. Delivered ILD: 20log10(0.877/0.356)=7.820\log_{10}(0.877/0.356) = 7.8 dB — partly preserved here, but the phases are wrecked: intended interaural phase was 0°(202°)=+202°0° - (-202°) = +202° (a large ITD-bearing difference), whereas delivered interaural phase is 13°(42°)=55°13° - (-42°) = 55°. The crosstalk has injected a spurious second arrival at each ear, collapsing the interaural phase difference from 202°202° toward a much smaller value and adding frequency-dependent comb ripple (work the same arithmetic at 1.5 and 3 kHz and the ILD and phase errors swing differently at each frequency — that is the comb colouration). The percept pulls in toward the loudspeakers (front-centre, narrow) and acquires a hollow timbre. The carefully authored front-left image is smeared.

XTC playback: restoring the target

Now insert the regularized inverse H=C1\mathbf{H} = \mathbf{C}^{-1} before the speakers, so v=Hb\mathbf{v} = \mathbf{H}\mathbf{b} and the eardrum pressures become p=CHb=b\mathbf{p} = \mathbf{C}\mathbf{H}\mathbf{b} = \mathbf{b} (exactly, in the unregularized ideal; approximately, with regularization). With the closed-form inverse at 2 kHz: det=Ci2Cc2=1(0.732°)2=10.4964°\det = C_i^2 - C_c^2 = 1 - (0.7\angle{-32°})^2 = 1 - 0.49\angle{-64°}. Compute 0.4964°0.49\angle{-64°}: real 0.2150.215, imag 0.440-0.440, so det=0.785+j0.440=0.90029°\det = 0.785 + j0.440 = 0.900\angle 29°. The filter applies diagonal gain Ci/det=1/0.90029°=1.11129°C_i/\det = 1/0.900\angle{-29°} = 1.111\angle{-29°} and cross gain Cc/det=0.732°/0.90029°=0.77861°=0.778119°-C_c/\det = -0.7\angle{-32°}/0.900\angle29° = -0.778\angle{-61°} = 0.778\angle119°.

The point of the algebra is not the intermediate numbers but the outcome: the delivered eardrum signals return to pL=1.000°p_L = 1.00\angle0° and pR=0.355202°p_R = 0.355\angle{-202°} — the intended binaural pair, with the full 202°202° interaural phase and the full 99 dB ILD restored. The spurious crosstalk arrival has been actively cancelled by the cross term injected into the opposite speaker, and the front-left image snaps back into focus, externalized outside the narrow dipole. The cost we paid is visible in the filter gains (1.111.11 and 0.780.78 here, modest at this benign frequency, but rising toward the singular bands where regularization caps them) and in the few-millimetre sweet spot within which this cancellation holds. That is transaural in one example: crosstalk smears the binaural front image; the inverted plant restores it, for one listener, in one place, over one band.

Limits and pitfalls

Inherent limits

  • Single listener, single position. The sweet spot is millimetres wide at high frequencies. Only head tracking relocates it, and even then only for the tracked person; bystanders are in anti-sweet-spots.
  • Limited cancellation depth. Real systems achieve roughly 10–20 dB of separation over a mid-band, not the 40–60 dB of headphones. The illusion is therefore softer and more fragile than headphone binaural.
  • Band-limited. XTC fades below a few hundred hertz (dipole roll-off, β1\beta\to1) and above 6–12 kHz (millimetre sweet spot, individual HRTFs). The protected band is mid-range.
  • Room-dependent. The plant model usually assumes the direct paths only; strong early reflections add un-modelled crosstalk that the filters cannot cancel and that re-smears the image. Transaural wants a relatively dead, reflection-controlled listening position — see reverberation and direct, diffuse and envelopment.
  • HRTF individualization. The binaural encode and the plant both depend on the listener's anatomy; non-individual HRTFs cause front-back confusion and timbral error, just as in headphone binaural.

Common mistakes

Common mistakes
  • Over-aggressive regularization tuning. Chasing maximum separation with tiny βreg\beta_{\text{reg}} yields a measurement that looks great and a listening experience that is brittle, coloured, and exhausting. Shaped, frequency-dependent regularization beats a flat aggressive setting.
  • Mismatched loudspeakers. XTC assumes plant symmetry; two drivers that differ by even 1 dB or a few microseconds inject residual crosstalk the filters cannot remove. Match and time-align the pair.
  • Wide speaker placement. Using widely-spaced stereo speakers (±30°\pm30°) for XTC starts the ill-conditioning combs below 2 kHz and demands heavy regularization. If transaural is the goal, narrow the span toward a stereo dipole.
  • Ignoring the room. Placing the dipole near a hard desk surface or a wall introduces a strong reflection that acts as a third, un-cancelled source. Treat the first reflection points or pull the speakers away from boundaries.
  • No head tracking, then blaming the algorithm. Much disappointing transaural is simply a listener sitting a few centimetres off the design point. Either fix the seating precisely or track the head.
  • Forgetting it is still binaural. Transaural cannot rescue a bad binaural encode; garbage HRTFs in, garbage spatial image out. The decode only preserves what the encode provides.

When to choose transaural

Choose transaural when you need binaural-style 3-D imagery without headphones, for one listener who can stay roughly put: desktop 3-D audio, a personal monitoring station, a single-seat demo, or a soundbar's virtual surround. Prefer headphones when isolation and robustness matter more than going hands-free. Prefer amplitude-panning methods — stereo, amplitude panning, surround — or Ambisonics and wave field synthesis when you need a large listening area or many listeners, since those tolerate movement that XTC cannot. And remember the unifying view: transaural shares the binaural encode with headphone playback and differs only in the decode — an inverted acoustic plant standing between the loudspeakers and the ears. Related formats and their channel/object structure are catalogued in formats, and the broader case that even two channels already carry spatial information is made in stereo is already spatial.

References

  1. Atal, B. S., and Schroeder, M. R. (1966). "Apparent Sound Source Translator." U.S. Patent 3,236,949 — the originating concept of loudspeaker crosstalk cancellation.
  2. Bauck, J., and Cooper, D. H. (1996). "Generalized Transaural Stereo and Applications." Journal of the Audio Engineering Society, 44(9), 683–705.
  3. Cooper, D. H., and Bauck, J. L. (1989). "Prospects for Transaural Recording." Journal of the Audio Engineering Society, 37(1/2), 3–19.
  4. Kirkeby, O., Nelson, P. A., and Hamada, H. (1998). "The 'Stereo Dipole': A Virtual Source Imaging System Using Two Closely Spaced Loudspeakers." Journal of the Audio Engineering Society, 46(5), 387–395.
  5. Nelson, P. A., Hamada, H., and Elliott, S. J. (1992). "Adaptive Inverse Filters for Stereophonic Sound Reproduction." IEEE Transactions on Signal Processing, 40(7), 1621–1632.
  6. Kirkeby, O., and Nelson, P. A. (1999). "Digital Filter Design for Inversion Problems in Sound Reproduction." Journal of the Audio Engineering Society, 47(7/8), 583–595.
  7. Gardner, W. G. (1998). 3-D Audio Using Loudspeakers. Boston: Kluwer Academic Publishers.
  8. Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (Revised ed.). Cambridge, MA: MIT Press.
  9. Møller, H. (1992). "Fundamentals of Binaural Technology." Applied Acoustics, 36(3–4), 171–218.

← Back to Spatialization Techniques