Transaural - Binaural over Loudspeakers
Transaural reproduction is the art of delivering a binaural signal — a pair of pressures designed to arrive, one each, at the listener's two eardrums — using ordinary loudspeakers instead of headphones. It sounds like it should be impossible. Headphones work precisely because each transducer is acoustically sealed to one ear: the left earpiece feeds the left eardrum and essentially nothing else. Loudspeakers have no such isolation. Each speaker radiates into a room and reaches both ears, and the two unwanted paths — left speaker to right ear, right speaker to left ear — are exactly what a binaural signal must not have. Transaural systems solve this by anticipating the leakage and pre-cancelling it: the loudspeaker feeds are computed so that, after the room has done its crosstalk mixing, what remains at each eardrum is the intended binaural signal and only that.
This is the same encode-then-decode structure that organizes every technique in this part of the guide. The encode is binaural authoring: a signal pair carrying the head-related cues for a desired direction, exactly as in binaural reproduction. The decode is new and specific to transaural: a crosstalk-cancellation (XTC) filter network that inverts the acoustic transfer matrix from the loudspeakers to the ears, so the binaural pair survives the trip through the air. Understanding transaural therefore means understanding that transfer matrix, why its inverse is numerically dangerous, and what compromises — regularization, narrow speaker spacing, head tracking — make the dangerous inverse usable in a real room.
This chapter builds the theory from first principles, works through the algebra of a 2×2 cancellation with numbers, then turns to the engineering realities: ill-conditioning, the stereo dipole of Kirkeby and Nelson, the tiny sweet spot, spectral colouration, and the practical systems (desktop 3D audio, soundbar virtualization) that ship transaural today. DAM Audio's work on this topic is collected under XTC / transaural crosstalk cancellation.
The problem: crosstalk destroys the binaural cues
What binaural assumes
A binaural signal is a pair of time- and frequency-dependent pressures, and , constructed so that if reaches only the left eardrum and reaches only the right eardrum, the listener hears a virtual source at the intended location. The construction is direction-dependent filtering by head-related transfer functions (HRTFs). For a source at direction feeding a monophonic signal , the binaural pair is
where and are the left- and right-ear HRTFs. The information the brain uses — the interaural time difference (ITD), the interaural level difference (ILD), and the spectral notches and peaks of the pinna — is encoded entirely in the difference and detail of this pair. The fundamentals are developed in psychoacoustics and the authoring in binaural. The crucial property for this chapter is the channel-to-ear isolation assumption baked into the equations above: left ear, right ear, with no mixing. Headphones honour that assumption. Loudspeakers, by their nature, violate it.
What loudspeakers actually do
Place two loudspeakers in front of a listener. Drive the left speaker with some signal and the right with . The pressure at the left eardrum is not . It is the sum of two contributions: the ipsilateral path from the left speaker (short, direct, same side) and the contralateral path from the right speaker (longer, around the head, opposite side). The same is true at the right ear. In symbols, with denoting the acoustic transfer function from speaker to ear ,
The two unwanted terms, and , are the crosstalk. They are not small. For loudspeakers at the extra path length to the far ear is on the order of 15–20 cm, the level is only a few decibels down from the direct path at low and mid frequencies, and the head shadow that attenuates the far-ear signal is significant only above roughly 1.5 kHz. So if you simply send a binaural signal to two loudspeakers — , — each eardrum receives its intended signal plus a delayed, head-filtered copy of the other channel.
That contamination is catastrophic for the binaural cues. The ITD encoded in versus relies on a clean comparison of arrival times at the two ears; the crosstalk superimposes a second arrival at each ear, with its own delay set by the loudspeaker geometry, not by the intended source direction. The ILD is similarly corrupted, and the pinna spectral cues are smeared by comb filtering between the direct and crosstalk paths. The result is a collapsed, frontally-locked, comb-coloured image — the familiar "phantom-centre plus narrow stage" of ordinary stereo, not the open spatial scene the binaural signal described. Crosstalk does not merely degrade binaural-over-speakers; it reduces it to conventional stereo.
The goal stated precisely
Transaural's objective is to choose loudspeaker feeds such that the eardrum pressures equal the binaural target:
Because the air imposes the mixing , the feeds we must send are . Everything that follows — the conditioning problems, the regularization, the dipole geometry, the tiny sweet spot — flows from the difficulty of building and using that inverse in a real acoustic system. The encode is binaural; the decode is the inverse of the room's own crosstalk.
The acoustic transfer matrix
The 2×2 plant
Collect the four paths into a frequency-domain matrix. At each frequency ,
This is called the plant matrix or the acoustic transfer matrix. Its diagonal entries are the ipsilateral (direct) paths; its off-diagonal entries are the contralateral (crosstalk) paths. Each entry is itself a complete transfer function: a delay (the time of flight from speaker to ear), an amplitude (governed by distance attenuation and head shadowing), and a complex frequency response (the HRTF for the loudspeaker's direction relative to the head, including pinna and torso effects). In other words, the plant entries are themselves HRTFs — specifically, the HRTFs for the loudspeaker positions.
Symmetry and the ideal listening geometry
For a listener seated symmetrically between two mirror-image loudspeakers (the canonical equilateral or near-equilateral arrangement), the geometry is left-right symmetric. Then the ipsilateral paths are equal, , and the contralateral paths are equal, . The plant simplifies to a symmetric matrix
with the ipsilateral (direct, near-ear) response and the contralateral (crosstalk, far-ear) response. This symmetry is enormously convenient: it lets us diagonalize the problem into a sum channel and a difference channel, which is the standard way to analyze and to implement XTC, and it makes the conditioning behaviour transparent. We will use it throughout, returning to the asymmetric case only when discussing head tracking and off-centre listeners.
A first-principles model of the two paths
To get intuition and numbers, model the ipsilateral and contralateral paths by their dominant features: a delay and a level. Let the direct path from a speaker to its near ear have length and the cross path to the far ear have length , where is the extra distance the sound travels around the head. With sound speed m/s, the interaural path delay for the loudspeaker is
A spherical-head approximation for a source at azimuth from the median plane gives the classic Woodworth extra-path estimate
with m the effective head radius and in radians. For loudspeakers at rad: , so m, about 9 cm, giving
The level difference between the cross and direct paths has two parts: spreading loss and head shadowing. Spreading alone for m and m is only dB — negligible. Head shadowing is frequency-dependent and dominates above ~1.5 kHz, reaching 5–15 dB at high frequencies but near 0 dB at low frequencies. So at low frequencies the crosstalk arrives only ~0.7 dB quieter than the direct sound and ~0.26 ms later: a strong, nearly-equal interfering copy. This is precisely why crosstalk is so destructive, and why cancelling it is hard at low frequencies and at high frequencies for opposite reasons, as the conditioning analysis will show.
Crosstalk cancellation: inverting the plant
The inverse filter
We want with , so the crosstalk-cancellation filter network is the matrix inverse of the plant:
is a 2×2 matrix of filters applied to the binaural pair before sending it to the loudspeakers. For the symmetric plant, the inverse has a clean closed form. With ,
Read this physically. The diagonal term is a frequency-shaping filter that equalizes the direct path. The off-diagonal term is the active cancellation: into the opposite loudspeaker it injects a signal that is the negative of the crosstalk, scaled and filtered so that when it arrives at the far ear it destructively cancels the unwanted leakage. The "−" sign is the whole idea: to silence the contralateral leakage of the left channel into the right ear, drive the right speaker with an inverted, appropriately delayed and filtered anti-signal.
The recursive cancellation picture
There is a beautifully intuitive way to see crosstalk cancellation that predates the matrix formulation and explains the convergence. Suppose we want only at the left ear and silence at the right ear. Send from the left speaker. It reaches the left ear (good) but also leaks to the right ear via relative to the direct. To cancel that leakage, send from the right speaker a signal , time-aligned and filtered, so it arrives at the right ear and cancels. But that cancelling signal itself leaks back to the left ear, where it is unwanted, so we must add a correction from the left speaker to cancel that, which leaks again to the right ear, and so on. Each round trip multiplies the residual by the crosstalk-to-direct ratio
so the total cancellation is a geometric series
The series converges whenever , i.e. whenever the crosstalk is weaker than the direct sound — which is true for sensible geometries. Summing the geometric series recovers exactly the matrix inverse above. This recursive view is due in spirit to Atal and Schroeder and was the basis of the earliest analog cancellers; it makes plain that XTC is fundamentally an acoustic feedback cancellation whose convergence rate depends on how close the crosstalk level is to the direct level.
A worked conceptual example
Take a single frequency where we model the paths by magnitude and delay only. Let the direct path be unit gain at zero reference delay, and the crosstalk be in magnitude with the extra delay ms computed earlier; at this frequency suppose the phase of is such that we can treat it as a real attenuation for illustration. We want at the left ear and at the right ear.
Round 0: left speaker emits . Left ear receives (target met); right ear receives (unwanted).
Round 1: right speaker emits to cancel. Right ear now receives ; but this signal leaks to the left ear as , corrupting the target.
Round 2: left speaker adds to restore the left ear; this leaks to the right ear as , breaking the cancellation again.
Continuing, the left-ear signal is and the residual at the right ear tends to zero. The net loudspeaker drive required is the matrix-inverse solution. Plugging , into the closed form, , so the diagonal filter gain is and the off-diagonal is . To play a binaural pair through this geometry we must therefore boost the loudspeaker drive by about dB relative to the binaural signal, and inject an anti-phase cross term nearly as large. That large boost — the inverse of a small determinant when approaches — is the seed of every practical problem with transaural, and it is the subject of the next section.
Ill-conditioning and regularization
Why the inverse blows up
The cancellation gain scales with . Whenever — the crosstalk approaches the direct path in magnitude and aligns in phase — the determinant approaches zero and the inverse filter demands enormous gain.
This is ill-conditioning: small acoustic quantities in the denominator produce huge filter responses, and tiny errors in the measured or modelled plant (a head that moved a centimetre, a room reflection, an HRTF mismatch) get amplified by the same huge factor.
Crucially, is frequency-dependent, and there are specific frequencies where the direct and cross paths interfere most severely. Because the crosstalk is delayed by , the relative phase between direct and cross is , which sweeps through as frequency rises. At frequencies where the contralateral path is in phase with what the inverse needs to cancel and nearly equal in magnitude, the plant is near-singular and the required boost spikes. Equivalently, in the sum/difference decomposition the plant's two eigen-responses are (sum channel) and (difference channel); the difference channel has deep nulls wherever , and inverting a null means infinite gain. These nulls occur roughly at frequencies where the wavelength makes the head-around path interfere destructively — concentrated at low frequencies (where because head shadowing vanishes) and at a comb of mid/high frequencies set by .
The condition number
The severity is quantified by the condition number of , the ratio of its largest to smallest singular value:
When is large the inversion is unstable: the system pours energy into the difference channel to no audible benefit, amplifies measurement error, and produces wild loudspeaker excursions. When the plant is well-behaved. Robust XTC design is essentially the project of keeping small across as wide a band as possible — and the two main levers are regularization (this section) and loudspeaker geometry (the stereo dipole, next section).
Regularization: trading cancellation for robustness
The cure for an explosive inverse is to refuse to fully invert near the singularities. Instead of we use a regularized inverse, the Tikhonov / least-squares solution
where is the conjugate transpose and is a regularization parameter (a small "effort penalty"). The term keeps the matrix being inverted away from singularity: where the plant is well-conditioned and , the regularized inverse equals the true inverse and cancellation is full; where the plant is near-singular and , the penalty dominates, the gain is capped at roughly , and the system gracefully gives up trying to cancel rather than blowing up.
Equivalently, each singular value of the plant is inverted not as but as the filtered version
which behaves like for large and like for small . The penalty thus rolls off the inversion exactly in the troublesome bands. This is mathematically identical to the constrained-optimization view of Kirkeby and Nelson, in which one minimizes the reproduction error subject to a penalty on loudspeaker effort, with the Lagrange multiplier balancing the two.
The bandwidth/colouration trade-off
Regularization is not free. Wherever the penalty suppresses the inverse, two things happen: the crosstalk is not fully cancelled there (loss of channel separation, hence weakened spatial cues in that band), and the frequency response of the delivered signal departs from flat (audible colouration). The design choice is therefore a trade-off:
- Small : maximal cancellation and the widest theoretical separation, but huge gains at the singular frequencies, severe colouration, large speaker excursion, and extreme sensitivity to head movement and plant error. Brittle.
- Large : robust, flatter, gentler on the speakers, but reduced cancellation depth and narrower usable bandwidth where the XTC actually works.
A common refinement is to make frequency-dependent — small in the mid-band where the plant is naturally well-conditioned and cancellation is most perceptually valuable, larger at the low end and at the high-frequency singular combs where full inversion would be ruinous.
This shapes the achievable XTC into a usable mid-band passband flanked by deliberately under-corrected extremes. The figure of merit Kirkeby and others use is the trade-off between channel separation (how many dB of crosstalk are removed) and spectral flatness / dynamic range; you cannot maximize both, and the regularization profile is where the engineer chooses the balance. The next section shows how geometry can shift the whole trade-off in the designer's favour, so that less regularization is needed.
The stereo dipole (Kirkeby and Nelson)
The insight: bring the speakers together
The singular frequencies of the plant are set by the interaural path delay , which is set by the angular separation of the loudspeakers. Wide-spaced speakers (e.g. ) give a large and therefore a dense comb of ill-conditioned frequencies starting low; the difference channel nulls early and often, so robust cancellation is confined to a narrow band. Kirkeby, Nelson, Hamada and Orduña-Bustamante's key realization (mid-1990s) was that if you move the two loudspeakers close together, subtending a small angle at the listener — they proposed about , a total span of roughly 10° — the geometry changes the conditioning dramatically. They named this closely-spaced pair the stereo dipole.
With a small span, the extra path to the far ear shrinks, shrinks, and the first difference-channel null is pushed up in frequency. The plant stays well-conditioned over a much wider band, so the inverse needs far less regularization to be robust, and the usable XTC bandwidth widens. The cost is borne mostly at low frequencies (discussed below), but across the perceptually critical mid-band the closely-spaced pair is markedly better-behaved than a wide pair.
Why "dipole"
At low frequencies the two closely-spaced speakers are driven largely in anti-phase by the cancellation filters (the difference channel dominates the cross-cancellation), so the pair radiates like an acoustic dipole — two nearly-coincident opposite sources. A dipole's far-field pressure rises with frequency (its radiation is weak at low frequencies, where the opposite sources cancel each other in the far field), which is the physical face of the same low-frequency difficulty seen in the conditioning: delivering separation at low frequencies from closely-spaced sources demands large, opposed excursions for little acoustic output. The name captures both the geometry (two close sources) and the radiation behaviour (dipolar at low frequency).
Geometry and a numeric comparison
Compare the interaural path delay for a wide pair and a stereo dipole using with m, then the first difference-channel null near (the lowest frequency at which the cross path is a half-cycle out of step over the band, a rough but instructive estimate).
| Configuration | Half-angle | (m) | (s) | First null estimate |
|---|---|---|---|---|
| Wide stereo | 30° (0.524 rad) | 0.0896 | 261 | ≈ 1.9 kHz |
| Standard narrow | 15° (0.262 rad) | 0.0459 | 134 | ≈ 3.7 kHz |
| Stereo dipole | 5° (0.0873 rad) | 0.0153 | 45 | ≈ 11.2 kHz |
The trend is the point: shrinking the span from 30° to 5° pushes the first deep ill-conditioning roughly from below 2 kHz up past 11 kHz, opening a wide, well-conditioned mid-band where robust, flat crosstalk cancellation is achievable with modest regularization. (These single-null estimates are simplified; the real plant has a comb of features and frequency-dependent shadowing, but the scaling of usable bandwidth with , and hence with inverse speaker span, is exactly the mechanism Kirkeby and Nelson exploited.)
Loudspeaker placement in practice
A stereo dipole is built from two small, matched loudspeakers placed close together — often in a single shared enclosure — at the same height as the listener's ears, on the median plane, typically 0.5–1.5 m in front. Matching between the two drivers is critical: any difference between and that is not due to head symmetry shows up as residual crosstalk the filters cannot remove. The narrow span gives a second practical benefit beyond bandwidth: because both speakers are near the median plane, the difference in their paths to a moving head changes more slowly, so the sweet spot, while still small, is a little more forgiving laterally than with wide speakers — though, as the next section shows, "a little more forgiving" is still very demanding.
The sweet spot
Why it is so small
Crosstalk cancellation is a precise interference effect: the anti-signal from the opposite speaker must arrive at the far ear with the right delay and amplitude to cancel the leakage. That cancellation is only exact at the position for which the plant was measured or modelled — the sweet spot. Move the head, and every path length changes, so every delay and phase in changes, but the filter is fixed. The cancellation that was a clean null becomes a partial, mistuned subtraction, and the residual crosstalk reappears.
How small is the tolerance? Cancellation of a contralateral path requires phase accuracy to a fraction of a wavelength. At a target upper frequency , a positional error that changes a path by produces a phase error . For the null to remain effective the phase error should stay well under, say, a quarter cycle (), i.e.
At kHz, mm. At kHz it is about 4 mm. In other words, to keep high-frequency cancellation intact the listener's head must hold position to within a few millimetres — less than the width of a fingertip. Lower frequencies tolerate more (at 1 kHz, mm), which is why the high end of XTC collapses first as you lean. A small rotation of the head is just as damaging, because it asymmetrically changes the four paths and breaks the left-right symmetry the filters assume.
A numeric illustration of misalignment
Return to the single-frequency cancellation with . Suppose the head shifts so that the cross-path delay used by the fixed filter is now wrong by a phase of at the frequency of interest. The residual after cancellation is proportional to times the crosstalk magnitude. With the design giving, say, 20 dB of cancellation at the sweet spot (residual factor 0.1), a lateral shift producing only ( rad) gives an extra mismatch term . The cancellation depth collapses from 20 dB toward roughly dB — i.e. from near-silence to crosstalk only 7 dB down: the spatial illusion is essentially gone.
A 30° phase error at 6 kHz corresponds to a path change of just mm. This is the quantitative reason transaural is a one-listener, one-position technology.
Head tracking moves the sweet spot
If the sweet spot is small and fixed, the obvious fix is to make it follow the listener. Head-tracked XTC measures the listener's head position and orientation in real time (camera-based face tracking, infrared markers, or inertial sensors) and continuously recomputes the plant and its regularized inverse , so the cancellation null stays centred on the moving ears. This is the same idea used in head-tracked binaural rendering, applied to the loudspeaker decode rather than the headphone decode.
Head-tracked transaural transforms the technique from a laboratory curiosity into something usable at a desk, because the listener no longer has to sit frozen. The engineering challenges are latency (the filter update must keep pace with head motion or the null lags behind the ears), filter interpolation (switching filters must not click or zipper), and the need for either a measured HRTF set or a parametric head model to generate for arbitrary poses. It does not enlarge the instantaneous sweet spot — it relocates it fast enough to stay under the listener. It also remains fundamentally single-listener: a second person sitting beside the tracked listener is in an anti-sweet-spot and hears scrambled, doubly-crosstalked sound.
Spectral colouration and bandwidth limits in practice
Where the colouration comes from
Even at the sweet spot a transaural system rarely sounds perfectly neutral, and the reasons are structural. First, the inverse filter equalizes the loudspeaker HRTFs (the plant), but the regularization deliberately leaves the response un-flattened in the ill-conditioned bands, so those bands are coloured by design. Second, the sum and difference channels are equalized to different effective targets, and any error in the assumed plant (a non-ideal room, a head a little larger or smaller than the model) shows as a frequency-dependent gain error — typically a series of peaks and dips, because plant errors interact with the near-nulls of the difference channel to produce comb-like ripple. Third, the binaural signal itself already carries the intended HRTF colour (the source-direction spectral cues), and any spectral mismatch between the HRTF used for authoring and the head doing the listening shows up as timbral shift, exactly as in headphone binaural.
The low- and high-frequency walls
Transaural has natural band edges. At the low end, the stereo dipole radiates inefficiently (dipole roll-off) and the plant is poorly conditioned because head shadowing vanishes and ; full cancellation there would demand enormous anti-phase excursion. Practical systems therefore relax XTC below a few hundred hertz, where the wavelength is long, interaural cues are weak, and crosstalk matters least perceptually. Below roughly 100–300 Hz one typically lets the system behave like ordinary stereo or mono, which is acoustically benign because localization at those frequencies is dominated by ITD on slowly varying envelopes, not by the fine structure XTC protects.
At the high end, the sweet spot shrinks as (the few-millimetre tolerance computed above), HRTF detail becomes intensely individual, and tiny plant errors cause large phase errors. Robust cancellation usually fades out somewhere between 6 and 12 kHz depending on geometry and regularization. Between these walls lies the usable XTC passband — broadly a few hundred hertz up to several kilohertz — which fortunately overlaps the band most important for the ITD/ILD localization cues described in psychoacoustics.
Quantifying the trade-off
A compact way to summarize practical XTC performance is the achievable channel separation versus spectral flatness as regularization is varied. The table below gives representative orders of magnitude for a stereo-dipole desktop system; exact figures depend on the plant, the room, and the regularization profile.
| Design stance | Peak channel separation | Usable XTC bandwidth | Spectral colouration | Robustness to head motion |
|---|---|---|---|---|
| Aggressive (low regularization) | 20–30 dB | narrow, with spikes | severe (deep combs) | very poor (a few mm) |
| Balanced (shaped regularization) | 12–18 dB | ~300 Hz – 7 kHz | moderate | poor (~1 cm) |
| Conservative (high regularization) | 6–10 dB | broad but shallow | mild | fair (a few cm) |
The practical takeaway: 10–20 dB of separation over a well-chosen mid-band, kept flat enough to avoid obvious colouration, is a realistic and useful target. That is far less than the 60+ dB of isolation headphones provide for free, but it is enough to externalize images and to pull phantom sources well outside the loudspeaker span — which ordinary stereo cannot do at all.
Implementations and use cases
Desktop 3D audio
The stereo dipole's natural home is the desk: a single listener, roughly fixed, a short distance from two closely-spaced loudspeakers or a pair near a monitor. Historically this is the configuration that revived transaural in the 1990s — Cooper and Bauck's "transaural stereo," Gardner's MIT work on "3-D audio using loudspeakers," and the Kirkeby/Nelson stereo dipole all targeted the seated, single-listener case. Modern desktop implementations combine an HRTF-based binaural renderer (the encode) with a regularized XTC filter network (the decode), optionally with camera head-tracking to keep the null on the listener. The payoff is externalized, out-of-head 3-D imagery from two speakers — virtual sources beside and even behind the listener — that flat stereo cannot deliver. DAM Audio's implementation and measurements are documented under XTC / transaural crosstalk cancellation.
Soundbar and TV virtualization
The most widespread consumer use of transaural is the virtual surround soundbar. A soundbar places several drivers in a single cabinet under the television, very close together — a stereo dipole by construction. By feeding the drivers crosstalk-cancelled binaural renders of surround or height channels, a soundbar can place virtual sources well outside its physical width: at the sides for surround channels, and above for height channels in object-based formats. The narrow driver spacing that makes a soundbar visually convenient is exactly the stereo-dipole geometry that makes XTC robust over a wide mid-band, which is why transaural and soundbars are such a natural pairing. The limitation is equally inherent: the effect is strongest for one centrally-seated viewer and degrades for off-axis seats — the sweet-spot problem in the living room. Many products blend XTC virtualization with acoustic beam-steering and wall reflections to spread the effect, trading cancellation depth for a larger, if vaguer, listening area.
Comparison with headphones
It is worth being explicit about why anyone would do this rather than simply hand the listener a pair of headphones, since headphones deliver the binaural encode with essentially perfect channel isolation and no crosstalk problem at all.
- Isolation: headphones give 40–60+ dB of channel separation for free; transaural fights to reach 10–20 dB over a limited band. On pure cue fidelity, headphones win decisively.
- Externalization and naturalness: loudspeaker reproduction adds the listener's own outer-ear and room cues to the sound arriving from real external sources, which can make transaural images externalize more naturally and avoid the in-head, frontally-collapsed quality that plagues non-individualized headphone binaural. The sound genuinely comes from out there.
- Comfort and sharing: no hardware on the head; in principle multiple people can be in the room (though only one is in the sweet spot).
- Robustness: headphones are immune to head translation and to the room; transaural is exquisitely sensitive to both.
The two are best seen as complementary delivery routes for the same binaural encode: the encode (HRTF authoring) is shared, and only the decode differs — direct to the ears for headphones, or through an inverted plant for loudspeakers. Transaural is the choice when you want binaural spatialization without headphones and can accept a single, mostly-stationary listener.
A worked example: crosstalk smear and what XTC restores
To make the whole chain concrete, follow a single binaural front image through (a) naive playback over speakers and (b) XTC playback, and see what is lost and restored.
The intended image
We author a virtual source at azimuth (front-left) by filtering a mono signal with the corresponding HRTFs. Suppose at 2 kHz the source-direction HRTFs give the near (left) ear a level of dB and the far (right) ear a level of dB with an ITD of ms (left ear leading). So the intended eardrum signals at 2 kHz are
At kHz, rad . So the target is a left ear leading by a large phase and 9 dB louder than the right — a clear front-left percept.
Naive playback: what the ears actually get
Send , straight to loudspeakers at (a stereo dipole geometry) without XTC. The eardrum pressures are . Using the symmetric plant with, at 2 kHz, ipsilateral (reference) and contralateral with s (from the dipole table). Here rad , so . The left eardrum receives
The second term has magnitude and angle (i.e. ). Adding : real part ; imaginary part ; magnitude , phase . The right eardrum receives
First term : real , imag . Second term : real , imag . Sum: real , imag ; magnitude , phase .
Now compare delivered to intended. Intended interaural level difference at 2 kHz: dB. Delivered ILD: dB — partly preserved here, but the phases are wrecked: intended interaural phase was (a large ITD-bearing difference), whereas delivered interaural phase is . The crosstalk has injected a spurious second arrival at each ear, collapsing the interaural phase difference from toward a much smaller value and adding frequency-dependent comb ripple (work the same arithmetic at 1.5 and 3 kHz and the ILD and phase errors swing differently at each frequency — that is the comb colouration). The percept pulls in toward the loudspeakers (front-centre, narrow) and acquires a hollow timbre. The carefully authored front-left image is smeared.
XTC playback: restoring the target
Now insert the regularized inverse before the speakers, so and the eardrum pressures become (exactly, in the unregularized ideal; approximately, with regularization). With the closed-form inverse at 2 kHz: . Compute : real , imag , so . The filter applies diagonal gain and cross gain .
The point of the algebra is not the intermediate numbers but the outcome: the delivered eardrum signals return to and — the intended binaural pair, with the full interaural phase and the full dB ILD restored. The spurious crosstalk arrival has been actively cancelled by the cross term injected into the opposite speaker, and the front-left image snaps back into focus, externalized outside the narrow dipole. The cost we paid is visible in the filter gains ( and here, modest at this benign frequency, but rising toward the singular bands where regularization caps them) and in the few-millimetre sweet spot within which this cancellation holds. That is transaural in one example: crosstalk smears the binaural front image; the inverted plant restores it, for one listener, in one place, over one band.
Limits and pitfalls
Inherent limits
- Single listener, single position. The sweet spot is millimetres wide at high frequencies. Only head tracking relocates it, and even then only for the tracked person; bystanders are in anti-sweet-spots.
- Limited cancellation depth. Real systems achieve roughly 10–20 dB of separation over a mid-band, not the 40–60 dB of headphones. The illusion is therefore softer and more fragile than headphone binaural.
- Band-limited. XTC fades below a few hundred hertz (dipole roll-off, ) and above 6–12 kHz (millimetre sweet spot, individual HRTFs). The protected band is mid-range.
- Room-dependent. The plant model usually assumes the direct paths only; strong early reflections add un-modelled crosstalk that the filters cannot cancel and that re-smears the image. Transaural wants a relatively dead, reflection-controlled listening position — see reverberation and direct, diffuse and envelopment.
- HRTF individualization. The binaural encode and the plant both depend on the listener's anatomy; non-individual HRTFs cause front-back confusion and timbral error, just as in headphone binaural.
Common mistakes
- Over-aggressive regularization tuning. Chasing maximum separation with tiny yields a measurement that looks great and a listening experience that is brittle, coloured, and exhausting. Shaped, frequency-dependent regularization beats a flat aggressive setting.
- Mismatched loudspeakers. XTC assumes plant symmetry; two drivers that differ by even 1 dB or a few microseconds inject residual crosstalk the filters cannot remove. Match and time-align the pair.
- Wide speaker placement. Using widely-spaced stereo speakers () for XTC starts the ill-conditioning combs below 2 kHz and demands heavy regularization. If transaural is the goal, narrow the span toward a stereo dipole.
- Ignoring the room. Placing the dipole near a hard desk surface or a wall introduces a strong reflection that acts as a third, un-cancelled source. Treat the first reflection points or pull the speakers away from boundaries.
- No head tracking, then blaming the algorithm. Much disappointing transaural is simply a listener sitting a few centimetres off the design point. Either fix the seating precisely or track the head.
- Forgetting it is still binaural. Transaural cannot rescue a bad binaural encode; garbage HRTFs in, garbage spatial image out. The decode only preserves what the encode provides.
When to choose transaural
Choose transaural when you need binaural-style 3-D imagery without headphones, for one listener who can stay roughly put: desktop 3-D audio, a personal monitoring station, a single-seat demo, or a soundbar's virtual surround. Prefer headphones when isolation and robustness matter more than going hands-free. Prefer amplitude-panning methods — stereo, amplitude panning, surround — or Ambisonics and wave field synthesis when you need a large listening area or many listeners, since those tolerate movement that XTC cannot. And remember the unifying view: transaural shares the binaural encode with headphone playback and differs only in the decode — an inverted acoustic plant standing between the loudspeakers and the ears. Related formats and their channel/object structure are catalogued in formats, and the broader case that even two channels already carry spatial information is made in stereo is already spatial.
References
- Atal, B. S., and Schroeder, M. R. (1966). "Apparent Sound Source Translator." U.S. Patent 3,236,949 — the originating concept of loudspeaker crosstalk cancellation.
- Bauck, J., and Cooper, D. H. (1996). "Generalized Transaural Stereo and Applications." Journal of the Audio Engineering Society, 44(9), 683–705.
- Cooper, D. H., and Bauck, J. L. (1989). "Prospects for Transaural Recording." Journal of the Audio Engineering Society, 37(1/2), 3–19.
- Kirkeby, O., Nelson, P. A., and Hamada, H. (1998). "The 'Stereo Dipole': A Virtual Source Imaging System Using Two Closely Spaced Loudspeakers." Journal of the Audio Engineering Society, 46(5), 387–395.
- Nelson, P. A., Hamada, H., and Elliott, S. J. (1992). "Adaptive Inverse Filters for Stereophonic Sound Reproduction." IEEE Transactions on Signal Processing, 40(7), 1621–1632.
- Kirkeby, O., and Nelson, P. A. (1999). "Digital Filter Design for Inversion Problems in Sound Reproduction." Journal of the Audio Engineering Society, 47(7/8), 583–595.
- Gardner, W. G. (1998). 3-D Audio Using Loudspeakers. Boston: Kluwer Academic Publishers.
- Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (Revised ed.). Cambridge, MA: MIT Press.
- Møller, H. (1992). "Fundamentals of Binaural Technology." Applied Acoustics, 36(3–4), 171–218.