Spatial Psychoacoustics
The Introduction made a promise: that every spatial audio system, from a humble stereo pair to a 64-channel dome, is in the end a machine for fooling two eardrums. This page makes good on that promise. It is the longest and most demanding chapter of Part I, because everything that follows in the guide, every panning law, every microphone array, every binaural filter, is an attempt to manufacture the cues described here. If you understand spatial hearing, you understand why the techniques work, not merely that they work.
We build the subject from first principles. Each mechanism gets an intuitive picture, a governing equation, and a worked number you can check on a calculator. By the end you should be able to look at a loudspeaker layout or a processing block and predict which localization cue it manipulates and how convincingly.
The problem the brain solves
Stand in a forest. Birds call from the canopy, a stream runs to your left, footsteps approach from behind. Your auditory system effortlessly assigns each sound a direction and a distance, separates overlapping sources, and tracks them as you turn your head. Yet the only data it receives are two scalar functions of time: the acoustic pressure at the left eardrum, , and at the right, . Two one-dimensional signals. From them the brain reconstructs a three-dimensional scene populated by multiple objects.
This is a profoundly underdetermined inverse problem. Consider what is lost. A direction in space needs (at least) two angles, azimuth (left/right and front/back) and elevation (up/down), plus a distance . That is three numbers per source, and there may be many sources, all summed together into the same two pressure waveforms:
where is the -th source signal, and are the acoustic transfer functions from that source to each ear (which depend on its direction and distance), and denotes convolution. The brain never observes , or the directions directly. It must infer them.
What makes the problem tractable is that the two ear signals are not independent. They are two views of the same world, captured by two sensors separated by roughly the width of a head and partially shadowed by it. The differences between and , in arrival time and in level, encode the lateral angle. The spectral shape imposed on each signal by the outer ear encodes elevation and resolves front/back. The ratio of direct to reflected energy and the absolute level encode distance. The brain has, over evolutionary and developmental time, learned the statistics of these transformations for its own head and ears, and it runs the inverse.
The remainder of this chapter dissects those cues one by one. We organize them as the field traditionally does: the two binaural (between-the-ears) cues first, ITD and ILD; then the theory that unifies them; then the monaural spectral cues that the binaural cues cannot supply; then the ambiguities that remain and how they are resolved; and finally the temporal and distance cues that place a source in depth.
A note on coordinates. Throughout we use a head-centred system: azimuth is measured in the horizontal plane, straight ahead, positive to the right; elevation is measured up from the horizontal plane; distance from the centre of the head. The interaural axis is the line through the two ear canals.
Interaural Time Difference (ITD)
Intuition. Your ears are about 18 cm apart. A sound coming from your right reaches the right ear before the left, because the path to the left ear is longer, it has to go around the head. That arrival-time difference is the single most reliable cue for the lateral position of a sound, and the auditory system is exquisitely tuned to it.
Geometry. Imagine the head as a rigid sphere of radius with the two ears at opposite ends of a diameter. For a source far enough away that the incoming wavefronts are effectively planar, the extra distance the sound must travel to the far ear has two parts: a straight-line component across the front of the head and a curved component wrapping around it. Working this geometry out gives Woodworth's classic low-frequency approximation for the path-length difference, and dividing by the speed of sound gives the time difference:
with in radians, head radius , and speed of sound (at about 20 °C). The term is the straight portion (the projection of the interaural axis onto the wave direction); the bare term is the arc the wave creeps around the sphere to reach the shadowed ear. The formula is exact in spirit for a sphere in the low-frequency, far-field limit and is accurate enough for design work across the audible range.
The maximum. ITD is greatest when the source is directly to one side, . Then
That is about 660 microseconds, the familiar quoted ceiling for the human ITD.
No natural free-field source can produce a larger interaural delay, which is why a delay of more than 0.66 ms between two otherwise identical signals does not push the image further out; it begins instead to be heard as a separate echo (we return to this under the precedence effect).
Sensitivity. Against that 660 microsecond full-scale range, the auditory system can resolve changes of roughly 10 microseconds near the midline, a temporal acuity finer than one cycle of the highest audible frequency and far finer than the firing jitter of any single neuron. It is achieved by coincidence detection: neurons in the medial superior olive fire maximally when spikes from the two ears arrive simultaneously, and arrays of them, fed by delay lines, effectively cross-correlate the two ear signals. Ten microseconds at the midline corresponds to a change in azimuth of only a degree or so, consistent with the minimum audible angle discussed later.
Worked example. Where does a source at sit? Convert: rad, . Then
So a sound 30° to the right reaches the right ear about a quarter of a millisecond before the left. Notice the function is roughly linear in near the front (small azimuths give proportional ITDs) and flattens toward the side, where adding more azimuth changes the delay very little, geometry alone tells us lateral resolution is best in front of us.
Phase ambiguity above ~1.5 kHz. The brain reads ITD largely from the phase of the waveform: it lines up the fine structure of and . That works only while the interaural delay is less than half a period, so that the matching peaks are unambiguous. The half-period of a sinusoid equals the maximum ITD, 0.66 ms, at a frequency of about , and ambiguity becomes serious by the time the wavelength approaches the head size, around 1.5 kHz. Above that the wavelength is shorter than the interaural path, several cycles fit in the delay, and the auditory system can no longer tell which cycle in one ear corresponds to which in the other. Fine-structure ITD therefore fails in the highs. (It survives in a weaker form as an envelope ITD, the delay between the slowly varying amplitude envelopes of high-frequency sounds, but the dominant high-frequency cue is something else entirely, which motivates the next section.)
Interaural Level Difference (ILD)
Intuition. Put your hand between a lamp and a wall and you cast a shadow. The head does the same to sound, but only when the sound's wavelength is small compared to the head. A high-frequency tone from the right is loud at the right ear and noticeably quieter at the left, because the head obstructs it. A low-frequency tone, whose wavelength is metres long, simply diffracts around the head as if it were not there, arriving at both ears at nearly equal level. So ILD is strongly frequency dependent: small at low frequencies, large at high frequencies.
The acoustic shadow. The relevant dimensionless quantity is , the head radius measured in wavelengths (with wavenumber ). When (long wavelengths, low frequencies) the sphere is acoustically transparent and the ILD is near zero. When the shadow deepens and the ILD grows. The crossover occurs at
so head shadowing starts to bite a little above 600 Hz and dominates by a few kilohertz. A useful rough model for the level difference of a far-field source is
where the slope rises from essentially dB at low frequency to large values in the highs. Empirically the maximum ILD (source fully to the side) climbs from a fraction of a dB at 200 Hz to perhaps 5–8 dB at 1 kHz and up to ~20 dB above 5–10 kHz.
Why it dominates the highs. Recall that fine-structure ITD becomes ambiguous above ~1.5 kHz. Conveniently, that is exactly where head shadowing has become strong, so the level cue takes over precisely where the timing cue fails. The two cues are complementary in frequency, the observation that organizes the whole of binaural hearing into the duplex theory below.
Worked example. Take a 6 kHz tone at and suppose the measured high-frequency slope is dB. Then . The right ear receives a signal more than 17 dB louder than the left, a ratio of about in pressure. By contrast a 300 Hz tone at the same angle, with of perhaps 1–2 dB, yields under 2 dB of difference: at low frequency the head barely shadows and the brain must rely on timing instead. This frequency split is the single most important fact about binaural localization, and it has a name.
The duplex theory and the 1.5 kHz crossover
Intuition. Lord Rayleigh, around 1907, put the two cues together into what is now called the duplex theory: lateral localization is served by interaural time differences at low frequencies and interaural level differences at high frequencies, with a transition region in between. Each cue covers the band where it is physically available and reliable, and the brain switches emphasis accordingly.
Why the crossover sits near 1.5 kHz. It is fixed by one length scale, the size of the head, viewed two ways.
- From the ITD side: phase-based timing is unambiguous only while the interaural delay is under half a wavelength. That condition breaks down as the wavelength shrinks toward the interaural path length, around 1.3–1.5 kHz. Above it, ITD in the fine structure is unusable.
- From the ILD side: head shadowing requires the wavelength to be comparable to or smaller than the head, i.e. , which begins around 600 Hz and yields robust level differences by 1–2 kHz.
The two transitions overlap in a band centred near . We can see the coincidence in a single relation, the wavelength equal to the interaural distance :
while the half-wavelength-equals-max-ITD criterion gives 760 Hz; the human system's effective changeover lies between these, near 1.5 kHz.
The practical consequence: the mid-frequency band around 1.5–3 kHz is the worst for lateralization, the timing cue has become ambiguous and the level cue is not yet at full strength. Localization blur is measurably largest there. Keep this in mind when reading the stereo and amplitude-panning chapters: panning laws implicitly trade on whichever cue is dominant in the signal's band.
Worked example of the handover. A broadband click at presents both cues at once. Its low-frequency content (say a 500 Hz component) carries an ITD of , which the brain reads from phase. Its high-frequency content (say 8 kHz) carries little usable ITD but an ILD of perhaps dB. Both point to the same 25° to the right, and because they agree, the image is crisp and stable. When a synthesis system makes them disagree, the listener hears a smeared, unstable, or "phantom-in-the-head" image, a failure mode every spatial engineer learns to recognize.
Spectral cues and the pinna: the HRTF
Intuition. ITD and ILD between them locate a sound on the left/right axis, but they say almost nothing about elevation or about front versus back. A source 30° up and a source 30° down on the same side produce nearly identical interaural time and level differences. Something else must disambiguate the vertical and front/back dimensions, and that something is the shape of your outer ear. The folds of the pinna, together with the head and torso, filter incoming sound in a way that depends on the direction it comes from, imprinting direction-specific peaks and notches on the spectrum. Your brain has learned to read that spectral fingerprint as an elevation and front/back cue.
The Head-Related Transfer Function. Formally, the complete acoustic transformation from a source in direction to each eardrum is captured by the Head-Related Transfer Function (HRTF), one for each ear:
where is the complex pressure spectrum at the ear and is the pressure that would exist at the head's centre with the head absent. Its time-domain counterpart, the Head-Related Impulse Response (HRIR), is what you convolve a dry signal with to synthesize a virtual source over headphones, this is the entire basis of binaural rendering. The HRTF is not a separate cue from ITD and ILD; it contains them. The interaural phase of versus is the ITD; the interaural magnitude ratio is the ILD. What the HRTF adds on top is the monaural spectral shape: the direction-dependent comb of peaks and notches imposed by the pinna.
Elevation from notches. The dominant pinna feature is a spectral notch in the 5–12 kHz region whose centre frequency moves upward as the source rises. Physically the notch arises from interference between the direct sound entering the ear canal and a copy reflected off a pinna ridge; if the reflection is delayed by relative to the direct path, destructive interference (a notch) occurs at
As elevation changes, the geometry changes the reflection delay , and so the notch frequency sweeps. The brain reads "notch near 7 kHz, therefore source somewhat below ear level," and so on. Because these features lie above 5 kHz, elevation perception depends on high-frequency content: low-pass-filtered or narrowband sounds are hard to place vertically.
Front/back from spectral balance. A source behind you is shadowed by the pinna (whose opening faces forward) differently from one in front: rear sources have relatively attenuated high frequencies and a characteristic notch pattern. This is a weak, easily overridden cue, which is why front/back confusions are common (see the cone of confusion next).
Individuality. Here is the catch for engineering. Pinna geometry is as individual as a fingerprint, so the fine structure of the HRTF, the exact notch frequencies, differs from person to person.
Listening through someone else's HRTF (a generic dummy-head set, for instance) degrades elevation accuracy and increases front/back confusions and in-head localization.
Personalization of HRTFs is a live problem in binaural audio, addressed in depth in the binaural chapter.
Worked example. Suppose a pinna reflection arrives after the direct sound for a source at ear level. The first notch sits at . If the source rises and the geometry shortens the reflection delay to , the notch climbs to . The brain interprets the 2 kHz upward shift of the notch as a rise in elevation, no interaural information required, which is exactly why a single ear, plugged on the other side, can still judge elevation while being nearly blind to azimuth.
The cone of confusion
Intuition. Because the two main binaural cues depend essentially on a single geometric quantity, the angle to the interaural axis, all the directions that share that angle share (nearly) the same ITD and ILD. Those directions form a cone whose apex is at the centre of the head and whose axis is the interaural axis. Anywhere on that cone of confusion, the binaural cues are ambiguous; a source in front and one behind, or one above and one below, can map to the same delay and level difference.
Geometry. Idealize the head as a sphere and the ITD as depending only on the angle between the source direction and the interaural axis. The ITD is then constant on each cone :
so a fixed ITD selects a fixed , i.e. a whole cone of directions, not a single point. The classic confusions are special cases: front/back pairs (mirror across the frontal plane) and up/down pairs both lie on the same cone of constant interaural angle. ITD and ILD alone cannot break the tie.
How the ambiguity is resolved. Two mechanisms rescue us.
- Spectral cues. The pinna notches of the previous section differ between front and back and between high and low, even when the interaural cues are identical. They collapse the cone to a point, when they are present and when the listener's own HRTF is in play. With a foreign HRTF, the cone reasserts itself and front/back errors multiply.
- Head movement. This is the elegant one, first analyzed by Wallach (1940s). Rotate your head a few degrees and watch how the ITD changes. For a source in front, turning to the right decreases the right-ear lead; for a source behind, the same head turn increases it. The two move in opposite directions, so a tiny voluntary rotation instantly disambiguates front from back, and the rate and sign of ITD change with head angle also pins down elevation. The brain correlates self-motion with the resulting cue changes, a dynamic cue that is far more robust than the static spectral one.
Worked example. A source directly to the right (, ) gives the maximum ITD, 660 microseconds. Now consider a source at but elevated 40°. It still lies on the cone of small , so its ITD is almost the maximum, the interaural cue barely distinguishes it from the source on the horizon. If the listener tilts or turns the head by 10°, the ITDs of the two candidate positions diverge by tens of microseconds, well above the 10 microsecond threshold, and the ambiguity dissolves.
This is why head-tracking transforms binaural audio: a static binaural render leaves listeners stuck on the cone of confusion; a head-tracked one lets them use the very mechanism evolution gave them. The implication recurs throughout the binaural and transaural chapters.
The precedence effect (Haas)
Intuition. Real rooms are full of reflections. Within a few milliseconds of the direct sound, dozens of delayed, attenuated copies arrive off walls, floor and ceiling, each from a different direction. If the auditory system localized each copy separately, every word in a room would seem to come from everywhere at once. It does not, because of the precedence effect (also called the law of the first wavefront, or the Haas effect): when a sound and its near-copies arrive within a short window, they are fused into a single perceived event whose location is set by the first arrival, the direct sound, while the later copies merely add loudness and spaciousness.
The phenomena, and their timescales. Let a direct sound be followed by a single copy delayed by .
- : summing localization. The two are fused and the perceived direction is a weighted average of both, this is the regime stereo loudspeakers exploit. A pan between two speakers works because delays and level differences under about a millisecond sum into one phantom image (covered in stereo and amplitude-panning).
- : precedence proper. Still fused into one event, but now the location is dominated by the first wavefront. The lagging copy can be considerably louder, by up to roughly 10–12 dB (this is Haas's specific finding), and the listener still localizes toward the leading source. The lag contributes timbre and loudness, not direction.
- : beyond the echo threshold. Fusion breaks; the copy is heard as a separate echo with its own location. The threshold is not a hard line, it depends on signal type (transients break fusion earlier than sustained sounds) and rises as the room "primes" the listener.
A compact way to state localization dominance: if the leading sound carries cue value and the lag , the perceived direction during precedence is
where the weight is near in the summing regime (both contribute) and climbs toward 1 as the lag falls in the precedence window (the lead dominates).
Worked example: Haas's classic figure. A talker stands in front; a loudspeaker to the side reproduces the same speech delayed by ms. To make the side speaker capture the image, you must raise its level by roughly 10 dB over the direct sound, and even then the perceived direction is only partly pulled aside. Below that level the listener localizes the original talker despite the loudspeaker being plainly audible. Sound reinforcement design lives on this fact: delay-and-fill loudspeakers are time-aligned so that the nearest (and earliest) source wins the localization, keeping voices anchored on the stage even when distant speakers supply most of the energy. Conversely, in reverberation and room chapters we will see how reflections arriving outside the precedence window create echo, coloration, and the sense of a space.
Why it matters here. The precedence effect is the bridge between the clean free-field cues above and the messy reality of rooms. It is the reason a single, correct first wavefront can survive a storm of reflections, and the reason spatial systems can place an image even in reverberant venues.
It also sets a hard constraint on multi-speaker systems: get the timing of the first arrival right, or no amount of correct level will fix the localization.
Distance perception
Intuition. Direction is only two of the three numbers we owe the inverse problem; the third is how far. Distance is the weakest and most context-dependent of the spatial percepts, because no single cue maps cleanly onto metres. The brain fuses several: overall loudness, the ratio of direct to reverberant energy, high-frequency air absorption, the delay before the first reflection, and (up close) curvature of the wavefront. None is individually reliable; together they yield a usable, though compressed, sense of depth. This section is a primer; the dedicated treatment is in Distance and Air.
Level and the inverse law. In a free field, sound pressure from a point source falls off inversely with distance:
Doubling the distance halves the pressure, a drop of dB, the familiar −6 dB per doubling (inverse-square in intensity, inverse in pressure). But loudness is an ambiguous distance cue on its own: a quiet sound nearby and a loud sound far away can reach the ear at the same level. The brain only converts level to distance when it has a prior expectation of the source's power (a whisper is intrinsically quiet; a shout is not).
Direct-to-reverberant ratio (DRR). In a room the direct sound obeys the inverse law, but the reverberant field is roughly uniform throughout the space. So as a source recedes, the direct energy falls while the reverberant energy stays put, and their ratio drops monotonically with distance:
Because the reverberant field is fixed by the room and largely independent of source power, DRR is a far more absolute distance cue than loudness, it does not require knowing how loud the source "should" be. This is why the same dry signal sounds nearer than its reverberant version, and why reverb sends are the engineer's primary depth control. Detailed in Reverberation.
Air absorption and the initial time gap. Over longer distances the atmosphere preferentially absorbs high frequencies (a few dB per 100 m at 4 kHz, more above), so distant sources sound duller; the brain uses spectral tilt as a (weak) distance cue. And the initial time gap between the direct sound and the first strong reflection shrinks as a source moves away from the listener toward a wall, supplying another depth hint.
The compressive power law. Crucially, perceived distance is not veridical. Pooling many studies, Zahorik and colleagues found that apparent distance follows a compressive power law of physical distance :
With an exponent below 1, listeners overestimate near distances and underestimate far ones, pulling the whole auditory world toward a comfortable middle distance.
A worked case: take , . A source physically at m is judged at m; one at m is judged at m. The 4 m and 9 m sources, separated by 5 physical metres, are perceived only 1 m apart, the depth axis is heavily compressed. Spatial systems that try to render large distance ranges must contend with this nonlinearity; it is far easier to convey that one source is farther than another than to convey how many metres.
Resolution limits: minimum audible angle and JNDs
Every cue has a finite acuity, and those limits set the resolution of the whole perceptual system, and therefore the precision any reproduction system needs to achieve.
Minimum audible angle (MAA). The smallest change in source azimuth a listener can reliably detect. It is best straight ahead, around 1° for broadband sounds, and degrades toward the sides, rising to perhaps 10° or more at . The pattern follows directly from the ITD geometry: recall , whose slope is largest at (value 2) and smallest at (value 1, but with the level cue also flattening). A given just-noticeable change in ITD therefore corresponds to a smaller angular change in front, finer angular resolution where the geometry is most sensitive. The MAA also worsens in the mid-frequency duplex gap around 1.5–3 kHz, exactly as the duplex theory predicts.
JNDs for the binaural cues. The just-noticeable differences are remarkably small:
- ITD JND: about 10–20 microseconds near the midline (rising toward the sides), against a 660 microsecond full range.
- ILD JND: roughly 0.5–1 dB, against a high-frequency range of up to ~20 dB.
A quick consistency check ties MAA to the ITD JND. Near the front the ITD slope is . A 10 microsecond ITD JND therefore corresponds to about , in excellent agreement with the measured ~1° frontal MAA. The geometry and the neural threshold predict the behavioural limit.
The table summarizes the cue inventory.
| Cue | Physical quantity | Dominant band | Full range | JND / acuity | Codes |
|---|---|---|---|---|---|
| ITD | Interaural time delay | < ~1.5 kHz | ~0–660 µs | ~10 µs (front) | Left/right (lateral) |
| ILD | Interaural level diff. | > ~1.5 kHz | ~0–20 dB | ~0.5–1 dB | Left/right (lateral) |
| Spectral (HRTF notches) | Monaural spectrum | > ~5 kHz | notch sweep 5–12 kHz | individual | Elevation, front/back |
| Head movement | ITD/ILD change vs. motion | broadband | — | — | Resolves cone of confusion |
| Precedence | First-arrival timing | broadband | fusion to ~30–40 ms | echo threshold | Localization in rooms |
| Loudness / DRR | Level, direct:reverb | broadband | room-dependent | several dB | Distance |
Implications for spatial system design
Pulling the threads together, the psychoacoustics dictates the engineering. A spatial reproduction system succeeds to the exact degree that it delivers consistent, correct versions of these cues to the two eardrums.
- Lateral imaging needs the right cue in the right band. Below ~1.5 kHz, get the interaural time/phase relationship right; above it, get the interaural level right. Stereo and amplitude panning generate ILD-like (and incidental ITD-like) differences between phantom-image contributions, Ambisonics reconstructs a sound field whose decoded ear signals approximate the correct cues over a sweet spot, and WFS physically rebuilds wavefronts so the cues are correct over an extended area. Whatever the method, the band-dependence of the duplex theory is the yardstick.
- Cue consistency is everything. When ITD and ILD disagree, or when a generic HRTF imposes the wrong spectral notches, the image smears, collapses into the head, or jumps front-to-back. Much of the art of binaural and transaural rendering is keeping all the cues mutually consistent.
- Elevation and front/back are fragile and high-frequency. They live in the pinna spectrum above 5 kHz and in head motion. Systems that ignore individualized HRTFs or omit head-tracking forfeit vertical accuracy and invite front/back confusion, the cone of confusion is the default failure mode.
- The first wavefront wins. Precedence means timing of the direct sound dominates localization even amid reflections; multi-loudspeaker and reinforcement systems must time-align so the intended source arrives first. Reflections then do their proper job, conveying distance, reverberation, and envelopment.
- Distance is compressed and cue-poor. Use DRR and reverberation as the primary depth tools, expect the perceived-distance power law to compress your range, and do not rely on loudness alone.
- Match the system's precision to perception. There is no point synthesizing ITDs to nanosecond accuracy when the ear's JND is ~10 microseconds, nor in ignoring a 1 dB ILD error in the highs when that is near the ILD JND. The resolution limits tell you where the engineering budget should go.
Every technique in Part II is, at bottom, a different strategy for delivering these cues, and every limitation you will meet, sweet-spot size, timbral coloration, front/back error, traces back to a cue that the method approximates rather than reproduces. With the perceptual machinery now in hand, the next chapter surveys the formats and representations, channels, objects, scenes, that spatial audio uses to carry these cues from production to playback.
References
- Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised ed.). MIT Press. The standard reference; comprehensive treatment of ITD, ILD, spectral cues, and the cone of confusion.
- Rayleigh, Lord (Strutt, J. W.) (1907). "On our perception of sound direction." Philosophical Magazine, 13(74), 214–232. Origin of the duplex theory.
- Woodworth, R. S. (1938). Experimental Psychology. Henry Holt. Source of the low-frequency ITD path-length approximation .
- Wallach, H., Newman, E. B., & Rosenzweig, M. R. (1949). "The precedence effect in sound localization." American Journal of Psychology, 62(3), 315–336. Also Wallach's earlier work on head movement and the cone of confusion.
- Wightman, F. L., & Kistler, D. J. (1989). "Headphone simulation of free-field listening. I & II." Journal of the Acoustical Society of America, 85(2), 858–878. Foundational HRTF measurement and validation.
- Haas, H. (1951). "Über den Einfluss eines Einfachechos auf die Hörsamkeit von Sprache." Acustica, 1, 49–58. The Haas (precedence) effect and the ~10–12 dB level allowance.
- Litovsky, R. Y., Colburn, H. S., Yost, W. A., & Guzman, S. J. (1999). "The precedence effect." Journal of the Acoustical Society of America, 106(4), 1633–1654. Modern review of fusion, echo threshold, and localization dominance.
- Zahorik, P., Brungart, D. S., & Rabinowitz, W. M. (2005). "Auditory distance perception in humans: A summary of past and present research." Acta Acustica united with Acustica, 91(3), 409–420. Source of the compressive perceived-distance power law.