Direct, Diffuse & Envelopment

A loudspeaker, or a pair of headphones, can put energy into the air. What it cannot do directly is decide whether you will hear that energy as a sharp point in space or as a sound that wraps around you and fills the room. That decision is made by your auditory system, and it hinges on a single property of the signals reaching your two ears: how similar they are to each other. Sound that is nearly identical at the two ears is fused into a compact, localizable image. Sound that differs from ear to ear, moment to moment, refuses to fuse and is instead heard as spacious, wide, enveloping.

This chapter is about that split. It is the deepest expression of the theme running through this whole guide: a spatializer does not merely place a source at an angle, it must reproduce the perceptual cues that make a placement believable. Width and envelopment are not angles at all. They are statistical properties of the sound field — coherence, correlation, the ratio of organized to disorganized energy — and they must be engineered deliberately. Get them wrong and a source that is technically "in the right place" still sounds flat, dead, pasted on. Get them right and even a modest layout breathes.

We will build the picture from first principles: the perceptual division between direct and diffuse sound, the two great spatial impression descriptors (apparent source width and listener envelopment), the coherence metrics that quantify them, the binaural cues underneath, and finally decorrelation as the engineering tool that turns a handful of loudspeakers into a convincing diffuse field. Along the way we meet the comb-filtering trap that punishes the naive approach, and we connect throughout to the psychoacoustics chapter where the underlying hearing mechanisms were established.

The perceptual split: coherent versus incoherent sound

Stand in any real room and clap once. The first thing your ears receive is the direct sound, travelling in a straight line from your hands. A few milliseconds later come the early reflections off the nearest surfaces, then a rising thicket of later reflections, and finally a smooth, dense tail in which individual reflections can no longer be distinguished. That tail is the diffuse field: energy arriving from effectively all directions at once, with no preferred direction and no recoverable structure.

These two regimes differ in a way that matters more to spatial hearing than their loudness or their timing. The direct sound is coherent: the waveform at your left ear and the waveform at your right ear are the same signal, merely shifted in time and tilted in level by the geometry of your head. The diffuse field is incoherent: at any instant the pressure at your left ear is the sum of thousands of reflections with random phases, the pressure at your right ear is a different random sum, and the two are statistically independent. The auditory system treats these two situations completely differently.

Why coherence decides fusion

Localization, as developed in the psychoacoustics chapter, runs on interaural time differences (ITD) and interaural level differences (ILD). To extract an ITD the brain must find the time shift that best aligns the left-ear and right-ear signals — it is, in effect, performing a running cross-correlation. This only yields a sharp, unambiguous answer when the two signals are genuinely shifted copies of one another, i.e. when they are coherent. A clean direct sound produces a tall, narrow correlation peak at one specific delay, and the brain reads that delay as a single, confident direction. The image is compact, the source is "a point".

When the two ear signals are incoherent, the cross-correlation is low and flat — there is no dominant peak, no single best delay, no consistent ILD. The auditory system cannot collapse this into a point. Instead it perceives the sound as spatially extended, diffuse, coming from a region rather than a place. The remarkable consequence is that the same total energy can be heard as a pinpoint or as an all-around wash, depending only on the correlation between the ear signals.

Key takeaway

Coherence is the master control of spatial impression.

The balance is the percept

Real listening is never purely one or the other. Every natural sound reaches you as a mixture: a coherent direct component that localizes it, plus an incoherent diffuse component contributed by the room. The balance between them is itself a powerful perceptual variable. A high proportion of coherent direct sound means a close, dry, sharply located source. A high proportion of incoherent diffuse sound means a distant, reverberant, spatially diffuse one. The brain reads this ratio continuously and uses it, among other things, to judge distance — a link we develop fully in the section on the direct/diffuse ratio and in the distance and air chapter.

So the job of a spatializer divides cleanly. It must deliver a coherent direct component, correctly shaped in ITD/ILD (or in inter-channel level/time for loudspeakers), to place the source. And it must deliver an incoherent diffuse component, spread across the reproduction system and decorrelated, to give the source size and to wrap the listener in space. The first is panning. The second is the subject of most of this chapter, and it is where naive implementations most often fail.

Apparent Source Width (ASW)

The first of the two classic spatial-impression descriptors is apparent source width, usually abbreviated ASW. It is the perceived horizontal extent of the sound source itself — whether a violin is heard as a thin vertical line in space or as a broad, present body of sound roughly the width of the real instrument plus its immediate acoustic surroundings. ASW is sometimes called "early spatial impression" because of what drives it.

Early lateral reflections are the engine

The single most important physical cause of ASW is the arrival of strong early lateral reflections — reflections that reach the listener from the sides within roughly the first 80 milliseconds after the direct sound. A reflection arriving from the side is largely decorrelated from the direct sound at the two ears, because the side path imposes a different ITD and ILD than the frontal direct path. The ear receiving the reflection earlier and louder versus the other ear creates a fluctuating, partially incoherent combination of direct-plus-reflection. Within the temporal "fusion window" — the same roughly $50$ ms window that produces the precedence effect, see psychoacoustics — these early reflections are not heard as separate echoes. They are fused with the direct sound into a single, broadened auditory event. The source grows wider.

The crucial word is lateral. Reflections arriving from the front or from directly overhead add the same (or nearly the same) signal to both ears and therefore keep coherence high; they make the sound louder and perhaps a bit fuller but they do not broaden it. Only the lateral component — energy arriving from the sides, generating interaural differences — drives width. This is why concert-hall acoustics prizes narrow, tall, side-reflecting shoebox geometries: the side walls launch strong lateral energy at the audience and the orchestra is heard as wide and enveloping rather than as a flat panel at the far end.

The lateral energy fraction and partial decorrelation

Acousticians quantify this with the lateral energy fraction, LF (sometimes LFE in older texts, not to be confused with the low-frequency effects channel). It is the ratio of sound energy arriving laterally in the early window to the total early energy:

$\mathrm{LF} = \frac{\text{early energy from the sides, roughly 5 to 80 ms}}{\text{total early energy, 0 to 80 ms}}$

The lateral component is typically captured with a figure-of-eight microphone oriented with its null pointing at the source, so it responds to sideways arrivals and rejects the frontal direct sound, while the total is captured with an omnidirectional microphone. LF is standardized as a room-acoustic descriptor in ISO 3382-1. Typical good concert halls run an LF of about $0.15$ to $0.25$ ; values much below that sound narrow and "in front of you", values much above sound diffuse and spacious.

What ASW really reflects is partial decorrelation. If the early field were perfectly coherent with the direct sound, the source would be a point. If it were perfectly incoherent, the source would dissolve into diffuseness and lose its location. ASW lives in between: enough decorrelation to broaden the image, not so much that it stops fusing into a single object with a definable centre. This is the same balance we will engineer deliberately with decorrelators later.

Worked example: estimating the width contribution of one reflection

Consider a frontal direct sound and a single early reflection arriving from $60$ degrees to the left, $12$ ms after the direct sound, attenuated by $4$ dB relative to it. Is this likely to broaden the image, and by roughly how much energy?

First, timing. $12$ ms is well inside the roughly $50$ ms fusion window, so the reflection fuses with the direct sound rather than being heard as an echo — good, it contributes to width not to echo. Second, level. A $4$ dB attenuation means the reflection carries $10^{-4/10} = 0.40$ times the energy of the direct sound. Third, laterality. A $60$ -degree arrival is strongly lateral; almost all of its energy counts as lateral energy. If the direct sound carries $1.0$ energy unit and this lone reflection $0.40$ , and we crudely treat the direct sound as non-lateral, the early lateral fraction would be about $\frac{0.40}{1.0 + 0.40} = 0.29$ . That is a high LF — this reflection alone would produce a noticeably wide, spacious image. In a real room there would be many such reflections from both sides; their summed lateral energy, partially decorrelated from the direct sound and from each other, is what sets the final perceived width.

Practical lesson

The practical lesson for a spatializer is that ASW is built by feeding partially decorrelated early energy into the side channels (or the lateral HRTF directions in binaural). Simply turning up the reverb send does not do it if that reverb arrives coherently or frontally.

Listener Envelopment (LEV)

The second descriptor is listener envelopment, LEV — the sense of being inside the sound rather than facing it. Where ASW is about the size of the source, LEV is about the listener's own immersion: the feeling that sound surrounds you and arrives from every direction, the feeling a great hall gives when the reverberant tail seems to hold you within it. LEV is widely regarded as the single most important contributor to the subjective impression of "being there", and it is the hardest thing to fake with a small number of loudspeakers.

The late, diffuse, all-directional field

LEV is driven by the late part of the energy — reflections arriving after roughly $80$ ms — provided that this late energy is strong, arrives roughly evenly from all directions (including the sides and ideally above), and is decorrelated at the two ears. This is precisely the diffuse field defined at the start of the chapter. The key requirements are three:

Lateness: the energy must come after the early-reflection window. Late arrivals no longer fuse with the direct sound into a single broadened object; instead they form a separate, sustained spatial impression that surrounds the listener.
Direction: the energy must arrive from many directions, with significant lateral and rear content. Energy arriving only from the front produces reverberation you face, not envelopment you sit inside.
Incoherence: the left-ear and right-ear signals must be decorrelated. Coherent late energy would pull back toward a localized phantom; incoherent late energy stays unlocalizable and therefore enveloping.

Beranek and others isolated late lateral energy as the principal physical correlate of LEV. A useful descriptor is the late lateral energy level, often written $G_{LL}$ or LJ, which measures the strength of laterally arriving sound in the late window relative to a free-field reference at $10$ m. The stronger and more lateral the late energy, the greater the envelopment.

ASW and LEV are genuinely different

It is worth stressing that ASW and LEV are separable percepts with separate physical causes, even though both depend on lateral, decorrelated energy. ASW depends on early lateral energy and broadens the source; LEV depends on late lateral energy and surrounds the listener. You can have one without the other. A close-miked, lightly reverberated pop vocal can be made wide (high ASW via early decorrelated energy) while sitting in an almost non-enveloping space (low LEV). A distant orchestra in a long-reverberation hall can be enveloping (high LEV) while each instrument remains fairly compact (moderate ASW).

Don't conflate the two

A spatializer that conflates the two — for instance by using the same reverb send to try to control both — will never get both right. The two must be addressed with different time windows and different routing, a point we return to when we discuss reverberation.

Worked example: early/late split of a reverberation tail

Suppose a synthetic room impulse response delivers, at the listener, the following energy budget relative to the direct sound (taken as $0$ dB): early reflections ( $0$ to $80$ ms) summing to $-3$ dB, and a late tail (beyond $80$ ms) summing to $-6$ dB. Of the early energy, suppose $25$ percent is lateral; of the late energy, suppose $60$ percent is lateral and well distributed in direction.

Converting to linear energy with the direct sound at $1.0$ : early total $10^{-3/10} = 0.50$ , of which lateral is $0.25 \times 0.50 = 0.125$ . Late total $10^{-6/10} = 0.25$ , of which lateral is $0.60 \times 0.25 = 0.15$ . So the late lateral energy ( $0.15$ ) actually slightly exceeds the early lateral energy ( $0.125$ ). This room will read as fairly enveloping (substantial decorrelated late lateral energy) while offering only moderate source widening. If a mix engineer complained that the result "surrounds me but every instrument is still a bit narrow", the fix would be to push more decorrelated energy into the early lateral window — raise early lateral content — rather than lengthening the tail. The numbers tell you which knob to turn.

Coherence metrics: IACC and ICC

Everything above rests on the idea of coherence between two signals. To engineer it we need to measure it. Two closely related metrics dominate: one defined at the ears (binaural) and one defined between channels (loudspeaker or microphone).

Interaural cross-correlation (IACC)

The interaural cross-correlation coefficient, IACC, measures the similarity of the signals at the two ears. It is computed as the maximum of the normalized cross-correlation function between the left and right ear signals, searched over the physiologically plausible delay range of plus or minus $1$ ms (because the head can impose at most about that much ITD). In words:

$\mathrm{IACC} = \max_{\tau \in [-1\,\mathrm{ms},\, +1\,\mathrm{ms}]} \text{(normalized cross-correlation of left-ear and right-ear pressure)}$

The normalization (dividing by the product of the two signals' energies) forces IACC between $-1$ and $+1$ ; in practice for room sound it runs between $0$ and $1$ . An IACC of $1.0$ means the ear signals are identical apart from a delay — perfectly coherent, a tight central image. An IACC near $0$ means the ear signals are statistically independent — fully diffuse, maximally enveloping. ISO 3382-1 standardizes IACC and its frequency-band variants. A common derived quantity is $(1 - \mathrm{IACC})$ , which rises with spatial impression, and the early/late split $\mathrm{IACC_{early}}$ (tied to ASW) versus $\mathrm{IACC_{late}}$ (tied to LEV).

Frequency matters. IACC is usually reported in low octave bands (around $500$ Hz to $2$ kHz) for spatial impression, because the ITD-based fusion that drives width and envelopment is strongest at low and mid frequencies. At high frequencies the head shadow decorrelates the ears for almost any field, so high-band IACC is less diagnostic of spaciousness.

Inter-channel coherence (ICC)

When we work with loudspeaker channels rather than ears, the analogous metric is the inter-channel coherence, ICC (also called inter-channel cross-correlation, ICCC, or simply the channel correlation). It measures how similar two channel signals are. A correlated stereo pair (ICC near $1$ ) collapses to a phantom centre and a narrow image; a decorrelated pair (ICC near $0$ ) spreads into a wide, diffuse field with no strong phantom. ICC is the engineering target a spatializer actually manipulates, because it can set the channel signals directly, whereas IACC is what the listener ultimately experiences after the channels mix acoustically at the ears. The relationship between the two is geometry-dependent: a given ICC between two front loudspeakers produces a particular IACC at the listening position depending on speaker angle, room, and head position.

ICC is central to parametric spatial audio coding (MPEG Surround, Spatial Audio Object Coding and their descendants), where a stereo or multichannel image is transmitted as a downmix plus a small set of spatial parameters — inter-channel level difference (ICLD), inter-channel time difference (ICTD), and inter-channel coherence (ICC). That ICC parameter, recovered at the decoder by synthesizing decorrelated signal, is exactly the "diffuseness" control that restores width and envelopment. It is a direct industrial confirmation that coherence is the right variable to parameterize spatial impression.

A table relating coherence to percept

The following table summarizes how coherence (whether measured as IACC at the ears or ICC between channels) maps onto perception. Treat the boundaries as soft; perception is continuous.

Coherence (IACC or ICC)	$1 - \text{coherence}$	Source image	Spatial impression	Typical cause
$0.95$ to $1.0$	$0.0$ to $0.05$	Tight point, dead centre	None; "in your head" or pinned to one spot	Identical signal on both channels; mono fold
$0.8$ to $0.95$	$0.05$ to $0.2$	Compact, slightly rounded	Subtle width	Strong direct sound, weak early reflections
$0.5$ to $0.8$	$0.2$ to $0.5$	Broadened, present body	Clear ASW, source has size	Healthy early lateral reflections; mild decorrelation
$0.2$ to $0.5$	$0.5$ to $0.8$	Wide, edges soften	Strong width tending to envelopment	Strong decorrelated early + late field
$0.0$ to $0.2$	$0.8$ to $1.0$	No definable location	Full envelopment; "inside the sound"	Dense diffuse field, fully decorrelated

Working region

Two practical readings of this table. First, the most useful working region for a source is the middle — coherence around $0.5$ to $0.8$ — where the image is broad and alive but still has a centre you can point at. Second, full envelopment requires driving coherence almost to zero, which is why a believable diffuse field demands genuine, deep decorrelation and not a token amount.

The relationship to binaural cues and to stereo encoding

Coherence is not a new mechanism bolted onto hearing; it is the statistical face of the very same ITD/ILD machinery covered in the psychoacoustics chapter. Tying the two together makes the engineering choices obvious rather than arbitrary.

Why incoherence defeats localization

Recall that the brain localizes by finding a consistent ITD and ILD. With coherent ear signals there is one delay that aligns them and one level ratio between them, so the cue is single-valued and the image is sharp. Feed the ears incoherent signals and the situation changes qualitatively: the best-aligning delay now wanders from instant to instant because the random fine structure keeps shifting, and the level ratio fluctuates too. The localization estimator receives a smear of momentary ITDs and ILDs spanning a range of directions. The brain's response to "many simultaneous directions" is not to pick one but to report spatial extent — width if the smear is moderate and early, envelopment if it is broad, late and all-around. So decorrelation does not add a width cue; it removes the consistency that would have produced a point, and width is what is left.

This also explains the fluctuation quality of spaciousness. Griesinger emphasized that envelopment is associated not with steady decorrelation alone but with continuously fluctuating interaural differences — the ear signals must keep changing their relationship over time. A static phase offset between two channels gives a fixed, if odd, image; a time-varying decorrelation gives the living, breathing sense of a real diffuse field. Good decorrelators are therefore designed to produce time-varying, frequency-dependent interaural differences, not a single frozen phase twist.

Width and diffuseness are already encoded in stereo

The companion chapter stereo is already spatial makes the point that a stereo signal carries far more than left/right balance; it carries width and diffuseness in its correlation structure. We can now state that precisely. The sum (mid, $L+R$ ) signal is the coherent, common part — it survives a mono fold and it localizes. The difference (side, $L-R$ ) signal is the incoherent part — it vanishes in mono and it carries the width and the diffuse field. The ratio of side energy to mid energy is, in effect, a measure of $(1 - \mathrm{ICC})$ : more side energy means lower correlation means wider, more diffuse, more enveloping.

This is why mid/side processing is the natural toolkit for spatial impression, and why a recording engineer who widens a mix is really lowering its inter-channel coherence. It is also why correlation meters and goniometers live on every mastering desk: they are coherence meters. A goniometer that collapses to a vertical line shows ICC near $1$ (mono, narrow); a goniometer that fills a circular cloud shows ICC near $0$ (decorrelated, wide). Everything in this chapter is visible on that little display. The spatializer's diffuse-field engineering is, from the stereo point of view, simply the deliberate generation of side energy that is incoherent with the mid and across frequency.

Decorrelation as an engineering tool

If coherence is the master control of spatial impression, then decorrelation — taking one signal and producing two or more output signals that sound the same but are mutually incoherent — is the master tool. A good decorrelator must satisfy three demands at once, and the tension between them is what makes the design interesting. The outputs must be (1) mutually incoherent across the relevant frequency range, (2) timbrally identical to the input, so no audible coloration is introduced, and (3) free of any single audible delay or echo. Meeting all three is genuinely hard, and several techniques exist.

All-pass networks

The workhorse decorrelator is the all-pass filter network. An all-pass filter passes every frequency at unity magnitude — it changes nothing about the spectrum, hence "all-pass" — but it imposes a frequency-dependent phase shift and therefore a frequency-dependent group delay. Two signals derived from the same source through two different all-pass networks have identical magnitude spectra (no coloration) but different phase relationships at every frequency, so their cross-correlation drops well below $1$ while their timbre is preserved. This is exactly what we want: same sound, different waveform, low coherence.

Practical decorrelators cascade many all-pass sections, often with randomized or mutually orthogonal coefficient sets per output channel, so that the phase trajectories of different outputs diverge as much as possible. Velvet-noise-based decorrelators — convolving with a sparse, randomly signed pulse sequence — are a modern, efficient variant that achieves a nearly flat magnitude response (low coloration) with a very short, dense impulse response (low echo), giving strong decorrelation cheaply. The DAM Audio High Space Resolution research, DAM HSR, is built around exactly this concern: distributing a source across many drivers while preserving the coherence relationships that the ear needs — deep decorrelation where the field should be diffuse, tight coherence where the image should localize — rather than letting the two states bleed into each other.

Short randomized delays

A cruder but intuitive approach is to feed each output a copy of the source delayed by a different short, random amount — a few milliseconds, varied per channel and ideally varied per frequency band. Different delays per ear or per channel produce different ITD/ICC, lowering coherence. The danger is obvious and central to the next section: a single fixed delay between two correlated copies is precisely the recipe for comb filtering. So practical delay-based decorrelation must use many short delays, randomized and frequency-dependent, so that the comb notches of one band fall on the comb peaks of another and average out to a flat perceived spectrum. The boundary between "delay decorrelation" and "all-pass decorrelation" is blurry — a dense network of frequency-dependent micro-delays is an all-pass-like structure.

Frequency-dependent phase and band splitting

The most controllable decorrelators split the signal into frequency bands and apply independent phase randomization or independent delays per band, often with the amount of decorrelation varied across frequency. This matters because the ear's sensitivity to coherence is itself frequency-dependent (strongest in the low-mids, per the IACC discussion). It also matters because over-decorrelating the low bass smears the mono foundation of a mix and can cause phase cancellation on a mono fold, whereas the high bands tolerate and benefit from heavy decorrelation.

Rule of thumb

A well-designed diffuse-field generator therefore applies little decorrelation below, say, $200$ Hz (keep the bass coherent and centred) and progressively more above it. This frequency-dependent strategy is the difference between a diffuse field that sounds rich and one that sounds hollow and phasey.

Making many speakers carry a mutually incoherent field

To wrap a listener requires not two but many output signals — the ring or dome of an immersive system — and they must be mutually incoherent, every pair of channels uncorrelated, not just incoherent in pairs. The standard construction is a bank of decorrelators, one per output, each driven by the same source but with a different (ideally orthogonal) phase/delay realization. With $N$ outputs you want all $\frac{N(N-1)}{2}$ channel pairs to have low ICC. All-pass networks with mutually orthogonal coefficient sets, or velvet-noise sequences seeded independently per channel, achieve this. The payoff is a diffuse field that the listener cannot localize from any seat — energy arrives from all around with no coherent pair to form a phantom — which is exactly the LEV recipe from earlier, now constructed deliberately rather than borrowed from a real room.

The comb-filtering trap

Common mistake

The most common and most damaging mistake in building a diffuse field is the one that seems most obvious: take the source and send the same signal to several adjacent loudspeakers to "spread it out". This does not spread the sound. It wrecks it, through comb filtering and image collapse, and understanding why is essential.

Why duplicating a correlated signal combs

When two loudspeakers radiate the same (correlated) signal and a listener is not exactly equidistant from both, the signal from the farther speaker arrives slightly later. Two copies of one waveform separated by a delay $t$ add constructively at frequencies where $t$ is a whole number of periods and destructively where $t$ is a half-period offset. The result is a comb filter: a spectrum stippled with deep regular notches at frequencies $f = \frac{2k+1}{2t}$ for $k = 0, 1, 2, \ldots$ and peaks at $f = \frac{k}{t}$ . The spacing between notches is $\frac{1}{t}$ hertz. The sound takes on a hollow, metallic, "through a tube" coloration, and worse, the coloration changes as the listener moves, because $t$ changes with position. The diffuse field you wanted becomes a position-dependent tone control.

There is a second, equally bad consequence: image collapse. Two correlated speakers do not produce a spread source; by the summing-localization mechanism behind amplitude panning (see amplitude panning) they produce a single phantom image somewhere between them. So duplicating a correlated signal across the ring gives you the worst of both worlds — a narrow phantom image plus moving comb coloration. It is the opposite of envelopment.

Worked example: the comb from a 1.5 ms path difference

Place a listener slightly off-centre between two loudspeakers carrying identical signal, such that the path lengths differ by $0.5$ m. Sound travels about $343$ m/s, so the time difference is $t = \frac{0.5}{343} = 1.46$ ms. The comb filter this produces has its first notch at $f = \frac{1}{2 \times 0.00146} = 343$ Hz, with further notches at $1029$ Hz, $1715$ Hz, $2401$ Hz and so on — every $\frac{1}{0.00146} = 685$ Hz. That is a notch roughly every two-thirds of an octave across the entire midrange: gross, audible coloration. Now have the listener lean $0.2$ m further, making the path difference $0.7$ m and $t = 2.04$ ms: the first notch drops to $245$ Hz and the spacing tightens to $490$ Hz. The whole comb pattern slides as the head moves — the unmistakable signature of correlated duplication, and an immediate diagnostic when a system sounds "phasey".

How decorrelation escapes the trap

Now send each speaker a decorrelated copy instead of an identical one. Because the two signals no longer share a common waveform, there is no fixed phase relationship to comb: at each frequency the two contributions add with a random relative phase, so on average their energies sum (a smooth $+3$ dB summation) rather than alternately reinforcing and cancelling. The spectrum stays flat, the coloration vanishes, and — because there is no coherent pair — no phantom forms. The energy is heard as coming from a broad region or, with enough channels, from all around. This is the whole point of decorrelation, and it is why a believable diffuse field must be decorrelated, never duplicated. The contrast is so sharp it is worth stating as a rule: correlated copies comb and collapse; decorrelated copies spread and envelop.

This is also where the DAM HSR approach earns its name. Spreading a source across a high-density array of drivers is only safe if the relationships between driver signals are managed — coherent where a tight image is wanted, decorrelated where a diffuse field is wanted. Letting correlated signal leak across adjacent drivers in a dense array would multiply the comb-filtering problem by the number of drivers. HSR's coherence preservation is precisely the discipline that keeps a high-resolution array from collapsing into a combed mess.

How the direct/diffuse ratio encodes distance, and how late reverb builds envelopment

Two of the most important uses of the direct/diffuse split are distance perception and envelopment. Both deserve a clear mechanical account, and both are developed further in neighbouring chapters.

The direct-to-reverberant ratio as a distance cue

In any reverberant space the direct sound obeys the inverse-distance law — its pressure falls as $\frac{1}{r}$ , so its energy falls as $\frac{1}{r^2}$ — while the reverberant (diffuse) energy is, to a first approximation, uniform throughout the room, set by the total power and the room's absorption rather than by where you stand. Therefore as a source moves away, the direct energy drops but the diffuse energy stays roughly constant, and the direct-to-reverberant ratio (DRR, sometimes D/R) falls. The auditory system uses this falling ratio as a robust distance cue, one that — unlike loudness — does not require knowing the source's intrinsic power. This is the mechanism explored in detail in distance and air; here we note only that it is the same direct/diffuse balance introduced at the top of this chapter, now read as distance.

The critical distance (or reverberation radius) is the distance at which direct and diffuse energy are equal, i.e. $\mathrm{DRR} = 0$ dB. Inside it the source reads as close and dry; beyond it, as distant and reverberant.

Worked example: distance from the direct/diffuse ratio

Take a room whose critical distance is $2$ m. At the critical distance, direct and diffuse energies are equal, $\mathrm{DRR} = 0$ dB. At $1$ m (half the critical distance) the direct energy is $\left(\frac{2}{1}\right)^2 = 4$ times its critical-distance value while the diffuse energy is unchanged, so $\mathrm{DRR} = 10\log_{10}(4) = +6$ dB — clearly close and dry. At $4$ m (twice the critical distance) the direct energy is $\left(\frac{2}{4}\right)^2 = 0.25$ of its critical value, so $\mathrm{DRR} = 10\log_{10}(0.25) = -6$ dB — clearly distant and reverberant. So moving from $1$ m to $4$ m, a factor of four in distance, swings the direct/diffuse ratio by $12$ dB. A spatializer that wants to portray that move convincingly must change the DRR by about that much — lowering the direct level relative to a constant diffuse send — not merely turn the source down, which would only change loudness and leave the source sounding close but quiet.

Late reverb is where envelopment is manufactured

The diffuse half of that ratio is supplied by reverberation, and specifically by the late reverberant tail. As established in the LEV section, envelopment requires late, decorrelated, all-directional energy. A reverberation algorithm built for spatial work therefore does two things beyond simply lengthening a decay: it generates multiple decorrelated output channels from the tail (so the late field is incoherent across the reproduction system), and it distributes those channels around the listener (so the late field arrives from all directions, with strong lateral content). A feedback delay network or scattering reverb naturally produces a dense, decorrelated late field; the spatial design is in how its many outputs are decorrelated from one another and routed to the surround/height channels. The early-reflection section of the same reverb, meanwhile, is routed and decorrelated to build ASW. One algorithm, two jobs, two time windows — the practical embodiment of the ASW/LEV distinction.

Managing the diffuse field across layouts

The same perceptual targets — controlled coherence for width, near-zero coherence and all-direction coverage for envelopment — must be met on systems ranging from two loudspeakers to a hundred-channel dome. The strategy changes with the number and placement of outputs.

Stereo

With only two front loudspeakers, true envelopment is impossible: there is no way to put physically decorrelated energy behind or beside the listener. What stereo can do is control coherence between its two channels to manage width and a sense of spaciousness. Lowering inter-channel coherence (raising side energy) widens the image and, via crosstalk at the ears, produces a modest sense of spaciousness — the field seems to extend beyond the speakers. The limits are real: the spaciousness is frontal, it collapses on a mono fold if the decorrelation is careless, and it cannot wrap the listener. The transaural techniques in transaural push two speakers further by cancelling crosstalk, but even they cannot synthesize a genuinely all-around late field from two frontal sources. Stereo's diffuse field is a frontal illusion, well worth doing but inherently bounded.

Surround

A 5.1 or 7.1 layout adds side and rear loudspeakers and so, for the first time, can deliver late energy from behind and beside the listener — the precondition for genuine LEV. The discipline becomes keeping the surround signals decorrelated, both from the fronts and from each other. The classic surround mistake is to feed the same reverb return to all surround channels: that is correlated duplication, and it combs and partially collapses exactly as described above, producing a "phantom rear" and a hollow tone instead of envelopment. The fix is a decorrelator bank: every surround channel gets an incoherent version of the diffuse field. Done right, a surround system produces strong, stable envelopment; done with correlated sends, it produces a small, combed, unstable rear image despite having all the speakers it needs. See surround and matrix for the channel mechanics.

Immersive and high-density layouts

Immersive layouts (7.1.4 and larger, ambisonic decodes, wavefield arrays) add height and many more outputs, and they make the mutual incoherence requirement acute. With a dozen or more diffuse-field outputs, every pair must be decorrelated, or some pairs will form coherent phantoms and comb. The construction is the decorrelator bank scaled up, with orthogonal phase/delay realizations per output and frequency-dependent decorrelation depth. Object-based and scene-based systems handle this in different places: in object-based audio the renderer must generate the decorrelated diffuse field at playout for whatever layout is present; in ambisonics the diffuse field can be encoded as a spatially-spread, low-order or decorrelated component and decoded to all speakers. Dense-array approaches such as wavefield synthesis and the DAM HSR philosophy take this furthest: many drivers, each carrying a precisely managed coherence relationship to its neighbours, so that the array can present a localizable image and a decorrelated diffuse field at the same time without the two interfering. The more drivers, the higher the spatial resolution available — but only if coherence is governed; ungoverned, more drivers means more comb filters.

The following table summarizes the diffuse-field situation per layout.

Layout	Outputs around listener	Genuine envelopment (LEV) possible	Main risk	Diffuse-field strategy
Stereo	2 frontal	No (frontal spaciousness only)	Mono-fold collapse; over-widening	Control inter-channel coherence; protect the bass; keep mono-compatible
Surround 5.1 / 7.1	5 to 7, horizontal	Yes (horizontal)	Correlated surround sends combing	Decorrelator bank; incoherent surround feeds
Immersive 7.1.4+	11+, with height	Yes, including vertical	Coherent pairs among many outputs	Mutually orthogonal decorrelators per output; frequency-dependent depth
High-density / HSR / WFS	Tens to hundreds	Yes, very stable	Comb filtering multiplied by driver count	Per-driver coherence management: tight where localizing, deep decorrelation where diffuse

How tools expose this, common mistakes, and limits

The actual parameters

Spatial-audio tools rarely use the words IACC or ICC on their front panels; they expose the same physics under friendlier names. Knowing the mapping lets you reason about what a knob really does.

Width / Size / Spread. Almost always a coherence control. Turning it up lowers the inter-channel coherence of the source's output (raises side energy, engages a decorrelator), widening the image. At the extreme it tips into diffuseness and the source loses its centre.
Diffusion / Diffuseness / Spaciousness. Usually controls the amount and decorrelation depth of the diffuse field relative to the direct sound — effectively the $(1 - \text{coherence})$ of the late/spread component. In reverbs it often also controls echo density.
Decorrelation / De-correlate / Mono-to-stereo. Direct exposure of the decorrelator: how aggressively the all-pass/velvet-noise network diverges the outputs. Often with a frequency or low-cut control to protect the bass.
Direct / Reverb balance, Dry/Wet, Distance. The direct-to-reverberant ratio — the distance cue. A "distance" macro typically lowers the dry level and raises a constant diffuse send together, as the worked example above prescribes.
Early / Late or ER/Tail balance. The ASW-versus-LEV split: early controls source width, late controls envelopment.
Correlation meter / Goniometer. The measurement side: a live readout of ICC. Use it constantly; it is the only number that tells you whether your "wide" mix will survive a mono fold.

Common mistakes and pitfalls

Over-decorrelation smears localization

The most frequent error once an engineer discovers decorrelation is to apply too much, to too wide a band, on too many sources. Driving a source's coherence to near zero does not make it impressively wide — it dissolves its location entirely, so a sound that should sit at a definite place becomes an unanchored wash. Reserve near-zero coherence for the diffuse field; keep sources in the productive $0.5$ to $0.8$ region where they are broad but still localizable.

Coloration from shallow decorrelators. A poor decorrelator — too few all-pass sections, a single fixed delay, a non-flat velvet-noise spectrum — colors the sound, adding a phasey, metallic, or hollow quality. This is the comb-filtering trap in miniature, occurring inside the decorrelator itself. The cure is a deeper, denser, magnitude-flat network with randomized per-channel realizations.

Bass decorrelation and mono incompatibility. Decorrelating low frequencies smears the mix's foundation and, worse, can cause the bass to partially cancel when the mix is summed to mono (for broadcast, club systems, phone speakers). Always protect the bass: little or no decorrelation below roughly $150$ to $250$ Hz.

Correlated diffuse sends. Feeding the same reverb/diffuse signal to multiple surround or height channels — the comb-and-collapse error of the dedicated section above. Every diffuse output needs its own incoherent feed.

Confusing width with envelopment. Using one early/coherent control to chase both percepts. They need different time windows and different routing; treat them separately.

Limits

The techniques here have hard boundaries. Geometric: with too few outputs around the listener, genuine envelopment is physically unavailable — stereo can imply spaciousness but cannot wrap you, and no amount of decorrelation invents a rear loudspeaker. Perceptual: the relationship between a synthesized coherence and the perceived width/envelopment is non-linear and content-dependent — transients, tonal sounds, and noise respond differently, and the same ICC sounds wider on broadband content than on a pure tone. Spectral: decorrelation trades coherence against coloration; you can always push coherence lower, but past a point the timbre suffers, so there is a practical floor. Robustness: spatial impression is strongest at the sweet spot and degrades off-axis; in rooms, the listener's own room adds its (uncontrolled) diffuse field on top of the synthesized one, which can help envelopment but muddies precise control. And binaural delivery has its own ceiling — over headphones, decorrelation must be applied through the correct HRTF directions or the diffuse field collapses inside the head rather than surrounding the listener, a problem treated in binaural. None of these limits invalidate the approach; they define the envelope within which the direct/diffuse balance can be sculpted, and recognizing them is what separates a believable spatial field from an over-processed one.

Worked example: decorrelating a mono source across a surround ring

To pull the whole chapter together, follow a single mono source as we progressively decorrelate copies of it around an eight-channel horizontal ring (channels at $0, 45, 90, 135, 180, 225, 270, 315$ degrees), and watch where each perceptual transition occurs. Assume a listener at the centre and a competent decorrelator bank available.

Stage 0 — one channel, full coherence. Send the mono source to a single front channel ( $0$ degrees). Inter-channel coherence is undefined (one channel) and the image is a hard point at $0$ degrees. ASW is minimal, LEV zero. This is the pure direct-sound case from the opening of the chapter.

Stage 1 — two correlated front channels (the trap). Now naively duplicate the identical signal onto the two front channels at $315$ and $45$ degrees. ICC between them is $1.0$ . The result is not a wider source: summing localization fuses them into a single phantom at $0$ degrees, and because the listener is rarely perfectly equidistant, a path difference of even $1.46$ ms (our earlier number) stipples the spectrum with notches every $685$ Hz. The source is narrow and combed — the worst outcome, and the textbook demonstration of why duplication fails.

Stage 2 — two decorrelated front channels. Replace the duplicate with a decorrelated copy on the $45$ -degree channel, driving the front pair's ICC down to about $0.6$ . Now the comb vanishes (random-phase summation, flat spectrum) and the source broadens: it has a definite centre near $0$ degrees but a clear width, an ASW you could measure. We are in the productive middle of the coherence table. This is a wide source, not yet envelopment.

Stage 3 — four decorrelated channels, front and sides. Add decorrelated copies on the side channels ( $90$ and $270$ degrees), each incoherent from the fronts and from each other, all pairwise ICC around $0.4$ . Lateral energy is now arriving from the true sides with low coherence. The source's width grows until its edges become hard to place; the percept begins to shift from "a wide thing in front of me" toward "sound is starting to be around me". This is the transition point between ASW and LEV — the moment added lateral decorrelated energy stops broadening a frontal object and starts surrounding the listener. It occurs, roughly, when substantial decorrelated energy first arrives from at or behind $90$ degrees.

Stage 4 — eight decorrelated channels, full ring. Drive all eight channels with mutually decorrelated copies, every pairwise ICC near $0.1$ , energy arriving evenly from all around. There is now no coherent pair anywhere on the ring, so no phantom can form and no comb can establish itself. The source has no location; it is a diffuse field. The listener experiences full envelopment — the LEV target — being inside the sound rather than facing it. Critically, this required deep, mutual decorrelation across all eight outputs; had any pair stayed correlated, that pair would have pulled a localizable phantom out of the wash and, off-centre, combed.

Reading the transition. Notice the trajectory in coherence terms: ICC fell from $1.0$ (point/comb) through about $0.6$ (wide source) and $0.4$ (edge of envelopment) to about $0.1$ (full envelopment), exactly tracking the coherence-to-percept table. Notice too where comb filtering threatens: only at Stage 1, and anywhere later that a correlated pair sneaks back in. The entire art is to keep the diffuse copies mutually incoherent so that adding speakers adds envelopment instead of adding comb filters — which is precisely the coherence-preservation discipline at the heart of DAM HSR, now applied to a humble eight-channel ring. A spatializer that understands this does not "spread a source by sending it everywhere"; it decorrelates a source into a field, and in doing so reproduces the perceptual cue — incoherence — that makes the listener believe they are no longer outside the sound, but within it.

References

Kuttruff, H. Room Acoustics, 6th ed. CRC Press, 2016. (Diffuse field theory, reverberation, critical distance.)
Beranek, L. Concert Halls and Opera Houses: Music, Acoustics, and Architecture, 2nd ed. Springer, 2004. (Lateral energy, spatial impression, and listener envelopment in halls.)
Blauert, J. Spatial Hearing: The Psychophysics of Human Sound Localization, revised ed. MIT Press, 1997. (Binaural cues, interaural cross-correlation, and the perception of spatial extent.)
Griesinger, D. "The Psychoacoustics of Apparent Source Width, Spaciousness and Envelopment in Performance Spaces." Acta Acustica united with Acustica, 83, 1997. (Fluctuating interaural differences; ASW versus LEV.)
Jot, J.-M. "Real-time spatial processing of sounds for music, multimedia and interactive human-computer interfaces." Multimedia Systems, 7(1), 1999. (Feedback delay networks, decorrelation, and reverberation for spatial rendering.)
Rumsey, F. Spatial Audio. Focal Press, 2001. (Source width, envelopment, decorrelation and channel coherence in reproduction systems.)
Hidaka, T., Beranek, L., and Okano, T. "Interaural cross-correlation, lateral fraction, and low- and high-frequency sound levels as measures of acoustical quality in concert halls." Journal of the Acoustical Society of America, 98(2), 1995. (IACC and lateral fraction as objective spatial measures.)
ISO 3382-1:2009, Acoustics — Measurement of room acoustic parameters — Part 1: Performance spaces. (Standard definitions of IACC, early and late lateral energy fraction.)
Faller, C., and Baumgarte, F. "Binaural Cue Coding — Part II: Schemes and applications." IEEE Transactions on Speech and Audio Processing, 11(6), 2003. (Inter-channel coherence ICC as a transmitted spatial parameter and its decoder synthesis by decorrelation.)

← Back to the Sound Field & the Room

The perceptual split: coherent versus incoherent sound​

Why coherence decides fusion​

The balance is the percept​

Apparent Source Width (ASW)​

Early lateral reflections are the engine​

The lateral energy fraction and partial decorrelation​

Worked example: estimating the width contribution of one reflection​

Listener Envelopment (LEV)​

The late, diffuse, all-directional field​

ASW and LEV are genuinely different​

Worked example: early/late split of a reverberation tail​

Coherence metrics: IACC and ICC​

Interaural cross-correlation (IACC)​

Inter-channel coherence (ICC)​

A table relating coherence to percept​

The relationship to binaural cues and to stereo encoding​

Why incoherence defeats localization​

Width and diffuseness are already encoded in stereo​

Decorrelation as an engineering tool​

All-pass networks​

Short randomized delays​

Frequency-dependent phase and band splitting​

Making many speakers carry a mutually incoherent field​

The comb-filtering trap​

Why duplicating a correlated signal combs​

Worked example: the comb from a 1.5 ms path difference​

How decorrelation escapes the trap​

How the direct/diffuse ratio encodes distance, and how late reverb builds envelopment​

The direct-to-reverberant ratio as a distance cue​

Worked example: distance from the direct/diffuse ratio​

Late reverb is where envelopment is manufactured​

Managing the diffuse field across layouts​

Stereo​

Surround​

Immersive and high-density layouts​

How tools expose this, common mistakes, and limits​

The actual parameters​

Common mistakes and pitfalls​

Limits​

Worked example: decorrelating a mono source across a surround ring​

References​