Aller au contenu principal

Reverberation & Room Models

A dry signal panned to an angle is a label, not a place. The ear accepts a virtual source as real only when the cues that always accompany a real source in a real space are present and mutually consistent. The single largest of those cues — larger than directivity, larger than air absorption, often larger than the angular pan itself — is reverberation: the dense, decaying cloud of reflected energy that every enclosed space wraps around every sound made inside it. This chapter is about that cloud: its structure in time, the physics that produces it, the numbers we use to describe it, the two great families of algorithms that synthesise it, the modern perceptual approach that controls it, and — the recurring theme of this entire guide — how a spatializer must do far more than feed one reverb tail to many loudspeakers if the result is to be believed.

Everything here connects back to the listening machinery described in psychoacoustics and forward to direct, diffuse & envelopment and distance & air. Reverberation is where the room stops being a backdrop and becomes part of the source.

Why reverberation is inseparable from spatial realism

The ear never hears the direct sound alone

Outside an anechoic chamber, you have never heard a sound without its room. From the moment you could hear, every voice, footstep, and instrument arrived as a direct wavefront plus hundreds of reflected copies. The auditory system is therefore not a naive level meter; it is a machine tuned by a lifetime of experience to decode rooms. It uses the reflected energy to estimate how far away a source is, how large the surrounding space is, what the walls are made of, and whether the scene is plausible at all. Strip the reverberation away and the brain does not hear "a clean source" — it hears something wrong, a source with no place to be.

This is why a perfectly panned dry source can still sound like it is glued to the loudspeaker rather than floating in the room. Amplitude panning (see amplitude panning) gets the direction right while leaving the distance and space undefined. The room information has to be added, and added in a way that agrees with the direction the panner asserts.

Reverberation as a bundle of cues, not one effect

It is tempting to treat reverb as a single "wetness" knob, but a room delivers at least four perceptually distinct pieces of information, each carried by a different part of the response:

  • Distance — carried mainly by the ratio of direct to reverberant energy.
  • Room size — carried by the time gap before the first reflections and by the decay time.
  • Source clarity / intelligibility — carried by how much energy arrives early versus late.
  • Apparent source width and listener envelopment — carried by the direction and correlation of the reflected energy at the two ears.

A spatializer that controls these four independently is doing real spatial work. One that only crossfades between dry and a stereo reverb plug-in is, at best, decorating. The rest of this chapter unpacks each cue and shows where in the room response it lives.

The structure of a room impulse response

Direct sound, early reflections, late tail

Fire an ideal impulse (a perfect click) from a source in a room and record the pressure at a listening point. The result is the room impulse response (RIR), and it has a universal three-part shape ordered in time:

  1. Direct sound. The first arrival, travelling in a straight line. Its delay is simply distance over the speed of sound: at c343c \approx 343 m/s, a source 55 m away arrives 5/34314.65 / 343 \approx 14.6 ms after the impulse. Its level falls with distance as p1/rp \propto 1/r (the inverse-distance law from distance & air). The direct sound carries the primary localisation cues — interaural time and level differences — because it arrives first and the precedence effect gives the first wavefront priority for "where".

  2. Early reflections. A sparse set of discrete, identifiable arrivals — one bounce off each major surface (floor, ceiling, side walls, front and back walls), then two-bounce combinations. They begin a few milliseconds after the direct sound and thin out over roughly the first 50–80 ms. They are deterministic: their timing and direction depend exactly on the room geometry and source/listener positions. Perceptually they are fused with the direct sound (precedence) but colour it, widen it, and tell the ear the room's size and the listener's position in it.

  3. Late reverberation. As reflections accumulate higher and higher orders, they become too numerous and too close together to resolve individually. The response collapses into a dense, noise-like, exponentially decaying tail. This is the diffuse field: statistically isotropic, decorrelated, and described not by individual echoes but by statistics — a decay rate and a spectrum. It carries envelopment and the sense of being inside a space.

The time scales and the perceptual division of labour

The transition times matter because the ear treats early and late energy differently. Reflections that arrive within roughly 50 ms of the direct sound are integrated with it by the auditory system: they raise loudness, improve clarity, and are not heard as separate echoes. This 50 ms figure is the basis of the D50 definition measure and underlies the C80 clarity measure (80 ms for music, which tolerates a longer fusion window). Energy arriving much later — beyond about 80–100 ms — that is also loud enough can be heard as a distinct echo or as muddiness.

RegionOnset (typical)DensityCharacterPrimary percept
Direct sound0 ms (after time-of-flight)single arrivaldeterministicdirection, distance
Early reflections~5–80 mssparse, discretedeterministic, geometry-dependentroom size, source width, clarity
Late reverberation~80 ms onwardvery dense, noise-likestatistical, exponential decayenvelopment, "inside-ness", decay

The boundary between early and late is not a hard wall; it is the mixing time, discussed below, and it depends on room size. The crucial design insight for spatial audio is that these regions should be synthesised by different machinery: early reflections by a geometric or directional model, late reverberation by a statistical one. We return to this in the section on spatial reverberation.

Early reflections in detail

The image-source idea

The cleanest way to compute early reflections is the image-source method, which dates to Allen and Berkley's 1979 model and has a beautifully simple physical basis. A specular reflection from a flat wall behaves exactly as if the sound came from a mirror image of the source, reflected across the plane of that wall, radiating through the wall as if it were not there. The path length of the reflection equals the straight-line distance from that image source to the listener.

Consider a source in a rectangular room. Reflecting it across each of the six surfaces produces six first-order image sources. Reflecting those across the surfaces produces second-order images (floor-then-wall, wall-then-ceiling, and so on), and the construction continues outward into an infinite three-dimensional lattice of images filling all of space. Each image, attenuated by the absorption coefficients of the surfaces its path crossed and by its own 1/r1/r distance law, contributes one reflection to the RIR. The method is exact for rectangular rooms with rigid flat walls and is the standard way real-time spatializers generate early reflections.

A numeric example: the first reflections in a small room

Take a room 6 m long, 4 m wide, 3 m high. Put a source and a listener both at 1.5 m height, the source at one end and the listener 3 m away along the centre line.

  • Direct path: 3.0 m, arriving at 3.0/343=8.73.0 / 343 = 8.7 ms.
  • Floor reflection: source and listener are both 1.5 m up, separated by 3 m horizontally. The image source is 1.5 m below the floor. Path length =3.02+3.02=18=4.24= \sqrt{3.0^2 + 3.0^2} = \sqrt{18} = 4.24 m, arriving at 4.24/343=12.44.24 / 343 = 12.4 ms — only 3.7 ms after the direct sound.
  • Ceiling reflection: by symmetry (ceiling is also 1.5 m away), identical 4.24 m, 12.4 ms.
  • Side-wall reflection: the listener is on the centre line, 2 m from each side wall. The image of the source across a side wall is 4 m to the side. Horizontal path =3.02+4.02=5.0= \sqrt{3.0^2 + 4.0^2} = 5.0 m, arriving at 5.0/343=14.65.0 / 343 = 14.6 ms, i.e. 5.9 ms after the direct.

So in this small room the entire first-order reflection set arrives within about 6 ms of the direct sound — well inside the 50 ms fusion window. The ear fuses them all into one event but reads the pattern (short delays, strong floor bounce) as "small room". Compare this with the hall example at the end of the chapter, where the same first reflections arrive tens of milliseconds later.

Lateral reflections, source width, and the floor/ceiling

Not all early reflections are perceptually equal, and this is the heart of why early reflections are a spatial tool, not just a tonal one.

Lateral (side-wall) reflections arrive from the side, so they reach the two ears with a large interaural time and level difference relative to the frontal direct sound. The auditory system, unable to separate them from the direct sound, instead broadens the perceived source. This is apparent source width (ASW), and it is driven specifically by early lateral energy. The descriptor for it, the lateral energy fraction (LF), is part of ISO 3382-style measurement and correlates strongly with the "spaciousness" that makes a concert hall sound good. A spatializer that wants a wide, enveloping source must therefore place early reflections to the sides, not pile them up frontally.

Floor and ceiling reflections arrive from above and below. Because the human head is roughly left–right symmetric, vertical reflections produce little interaural difference; they do not widen the source the way lateral reflections do. Instead they cause comb filtering of the direct sound — a periodic series of peaks and notches in the frequency response, spaced at 1/delay1 / \text{delay} Hz. The floor reflection in our example, delayed 3.7 ms, produces notches at 1/0.00372701 / 0.0037 \approx 270 Hz and odd multiples. This is the familiar tonal "floor bounce" colouration; it is why microphone and loudspeaker placement near reflective floors is fussy. Perceptually, the early floor reflection is mostly a colouration and a subtle distance/ground cue, not a width cue.

The design consequence: early reflections must be directional. Their value is not "some early energy" but "early energy arriving from specific directions". Feeding early reflections as a mono blob to all speakers throws away exactly the lateral/vertical distinction the ear cares about. We make this concrete in the spatial-reverberation section.

The late reverberant field

The statistical / diffuse field

After enough bounces, the reflection density grows faster than linearly — in fact the number of reflections arriving per second grows as the square of time in a three-dimensional room, because the radius of the expanding image-source lattice grows linearly and the surface area it sweeps grows as the square. Within a few tens of milliseconds in a normal room, reflections overlap so densely that no individual echo can be identified. At this point the field becomes diffuse: energy arrives from all directions with equal probability, the sound field is statistically homogeneous and isotropic, and the only meaningful description is statistical.

In an ideal diffuse field the two-ear signals are decorrelated above a certain frequency, which is precisely what produces listener envelopment (LEV) — the sensation of being surrounded by, and immersed in, sound rather than facing it.

Most common pitfall

This decorrelation is the single most important property to preserve when reproducing late reverb over many loudspeakers, and the single property most often destroyed by naive implementations.

Exponential decay

Once the field is diffuse, energy decays exponentially: each unit of time, a fixed fraction of the remaining energy is absorbed at the boundaries, so the energy follows E(t)=E0et/τE(t) = E_0 \, e^{-t / \tau} and the level in decibels falls linearly with time. This straight-line decay on a dB scale is the signature of a well-behaved room and the basis of every decay-time measurement. The slope of that line defines RT60 (next section). Real rooms often show a slightly curved or double-sloped decay — a faster early slope and a slower late slope — when the field is not perfectly diffuse or when coupled spaces are involved; this is itself a perceptual cue and a known modelling challenge.

Mixing time and Schroeder frequency

Two numbers mark the boundaries of the diffuse regime.

The mixing time is the instant at which the response stops being a set of discrete reflections and becomes a Gaussian-noise-like diffuse tail — the early/late boundary. A widely used estimate is tmixVt_{\text{mix}} \approx \sqrt{V} in milliseconds, with VV the room volume in cubic metres (Polack's relation). For a 1200012\,000 m³ concert hall, tmix12000110t_{\text{mix}} \approx \sqrt{12000} \approx 110 ms; for a 7272 m³ room (our small example, 6×4×36 \times 4 \times 3), tmix728.5t_{\text{mix}} \approx \sqrt{72} \approx 8.5 ms. Larger rooms take longer to "mix", which is why halls have a long, audibly evolving early section while small rooms jump almost immediately to a dense tail.

The Schroeder frequency marks the boundary in frequency between the regime where room modes (discrete standing-wave resonances) are individually resolvable and the regime where they overlap into a statistical continuum. Schroeder's relation is fS2000RT60/Vf_S \approx 2000 \sqrt{\text{RT60} / V} Hz, with RT60 in seconds and VV in cubic metres. Below fSf_S the room is "modal" — individual resonances dominate, and the response is sensitive to exact position; above it the statistical/diffuse description applies. For a living room of 5050 m³ with RT60=0.5\text{RT60} = 0.5 s, fS20000.5/50=2000×0.1=200f_S \approx 2000 \sqrt{0.5/50} = 2000 \times 0.1 = 200 Hz. For a 1200012\,000 m³ hall with RT60=2.0\text{RT60} = 2.0 s, fS20002/1200026f_S \approx 2000 \sqrt{2/12000} \approx 26 Hz — essentially the whole audible band is statistical, which is why concert-hall reverb is so smooth, while small rooms are dominated by audible modal colouration in the bass.

Room-acoustics descriptors and ISO 3382-1

The international standard ISO 3382-1 defines how to measure the acoustic parameters of performance spaces from a measured impulse response. The descriptors below are the working vocabulary of room acoustics and of any serious reverb design.

RT60 — reverberation time

RT60 is the time for the sound energy level to decay by 60 dB after the source stops — historically chosen because 60 dB was roughly the dynamic range between a loud orchestra and the noise floor of a quiet hall. In practice the full 60 dB is rarely clean, so ISO 3382 measures the slope over a smaller range (T20 over a 20 dB drop, T30 over 30 dB) and extrapolates to 60 dB. The decay curve itself is best obtained by Schroeder's backward integration of a measured impulse response, which yields a smooth decay function from a single measurement.

The Sabine equation (Wallace Clement Sabine, ~1898, the founding result of architectural acoustics) gives RT60 from room volume and total absorption:

RT60=0.161VA\text{RT60} = 0.161 \, \frac{V}{A}

where VV is volume in m³, AA is the total absorption in sabins (square metres of equivalent perfect absorber), A=SiαiA = \sum S_i \, \alpha_i over all surfaces of area SiS_i and absorption coefficient αi\alpha_i, and 0.1610.161 is a constant (24ln(10)/c24 \ln(10) / c) for SI units at room temperature.

The Eyring equation corrects Sabine for highly absorptive rooms, where Sabine wrongly predicts a non-zero RT60 even for a totally absorbing room (α=1\alpha = 1). Eyring replaces the absorption term with Sln(1αmean)-S \ln(1 - \alpha_{\text{mean}}):

RT60=0.161VSln(1αmean)\text{RT60} = 0.161 \, \frac{V}{-S \ln(1 - \alpha_{\text{mean}})}

where SS is the total surface area and αmean=A/S\alpha_{\text{mean}} = A / S is the average absorption coefficient. For small αmean\alpha_{\text{mean}}, ln(1αmean)αmean-\ln(1 - \alpha_{\text{mean}}) \approx \alpha_{\text{mean}} and Eyring reduces to Sabine; the two diverge as the room gets more absorptive.

Numeric example: RT60 of a room, both equations

Take a rehearsal room 8 m×6 m×4 m8 \text{ m} \times 6 \text{ m} \times 4 \text{ m}.

  • Volume: V=8×6×4=192V = 8 \times 6 \times 4 = 192 m³.
  • Surface area: two walls 8×48 \times 4 (=32= 32 each), two walls 6×46 \times 4 (=24= 24 each), floor and ceiling 8×68 \times 6 (=48= 48 each). S=2×32+2×24+2×48=64+48+96=208S = 2 \times 32 + 2 \times 24 + 2 \times 48 = 64 + 48 + 96 = 208 m².
  • Assume an average absorption coefficient αmean=0.15\alpha_{\text{mean}} = 0.15 (a moderately furnished room). Total absorption A=S×αmean=208×0.15=31.2A = S \times \alpha_{\text{mean}} = 208 \times 0.15 = 31.2 sabins.

Sabine: RT60=0.161×192/31.2=30.9/31.2=0.99\text{RT60} = 0.161 \times 192 / 31.2 = 30.9 / 31.2 = 0.99 s.

Eyring: ln(10.15)=ln(0.85)=0.1625-\ln(1 - 0.15) = -\ln(0.85) = 0.1625. Effective absorption =208×0.1625=33.8= 208 \times 0.1625 = 33.8 sabins. RT60=0.161×192/33.8=30.9/33.8=0.91\text{RT60} = 0.161 \times 192 / 33.8 = 30.9 / 33.8 = 0.91 s.

Eyring predicts about 8 percent shorter, the difference growing if we make the room more absorptive. Now add heavy acoustic treatment raising αmean\alpha_{\text{mean}} to 0.450.45: Sabine gives 0.161×192/(208×0.45)=30.9/93.6=0.330.161 \times 192 / (208 \times 0.45) = 30.9 / 93.6 = 0.33 s, while Eyring gives ln(0.55)=0.598-\ln(0.55) = 0.598, 0.161×192/(208×0.598)=30.9/124.4=0.250.161 \times 192 / (208 \times 0.598) = 30.9 / 124.4 = 0.25 s — now a 24 percent difference.

Rule of thumb

Eyring is the better choice for dead rooms; Sabine is adequate and simpler for live ones.

EDT — early decay time

EDT is the decay time derived from the slope of only the first 10 dB of decay, extrapolated to 60 dB. Because the early part of the decay is dominated by early reflections rather than the fully diffuse tail, EDT correlates much better with the perceived reverberance than RT60 does. A hall can have a long RT60 but a short EDT (strong early energy, rapid initial decay) and will be perceived as relatively "clear" and intimate.

Key takeaway

When designing or judging reverb, EDT is often the more honest number.

C80 (clarity) and D50 (definition)

These two descriptors quantify the early/late energy split that governs intelligibility and clarity.

C80 (clarity) is the ratio, in dB, of energy in the first 80 ms to energy after 80 ms:

C80=10log10(E(080ms)E(80msend))C80 = 10 \log_{10}\left( \frac{E(0\ldots80\,\text{ms})}{E(80\,\text{ms}\ldots\text{end})} \right)

Positive C80 means early energy dominates — clear, articulate, good for fast music. Negative C80 means the late tail dominates — reverberant, blended, potentially muddy. Concert-hall C80 typically runs from about 2-2 to +4+4 dB depending on repertoire.

D50 (definition / Deutlichkeit) is the fraction of energy in the first 50 ms:

D50=E(050ms)E(0end)D50 = \frac{E(0\ldots50\,\text{ms})}{E(0\ldots\text{end})}

expressed 0 to 1 (or as a percentage). D50 above about 0.50.5 (50 percent) is associated with good speech intelligibility. It is essentially the same early/late idea as C80 with a 50 ms boundary and a ratio-to-total rather than a dB-difference formulation.

Critical distance

The critical distance DcD_c is the distance from the source at which the direct sound energy equals the reverberant (diffuse) energy. Closer than DcD_c, the direct sound dominates and the source is clear and well localised; farther than DcD_c, the diffuse field dominates and the source becomes distant, blurred, and hard to localise. It is the physical anchor of the distance percept and is given by:

Dc0.057VRT60D_c \approx 0.057 \sqrt{\frac{V}{\text{RT60}}}

(in metres, VV in m³, RT60 in s, for an omnidirectional source)

For our rehearsal room (V=192V = 192, RT60=0.99\text{RT60} = 0.99 s): Dc0.057192/0.99=0.057194=0.057×13.9=0.79D_c \approx 0.057 \sqrt{192/0.99} = 0.057 \sqrt{194} = 0.057 \times 13.9 = 0.79 m. So in that room, anything beyond about 0.8 m is heard predominantly through its reverberant field — which is why even a small reverberant room makes a source one or two metres away sound clearly "in the room". A directive source (see source directivity) projects more direct energy forward and effectively increases DcD_c along its axis by a factor of Q\sqrt{Q}, where QQ is the directivity factor. This is the link between directivity and distance perception, and it is why a trumpet pointed at you sounds closer than the same trumpet pointed away.

DescriptorWhat it measuresTime/quantityGoverns
RT60 (T20, T30)full decay ratetime for 60-60 dBroom size, "liveness"
EDTearly decay rateslope of first 10-10 dBperceived reverberance
C80early/late energydB ratio, 80 ms splitclarity of music
D50early energy fraction0–1, 50 ms splitspeech definition
LFlateral early energyfractionapparent source width
DcD_cdirect = diffusedistance (m)distance perception

Convolution reverb

The principle

A room is, to a good approximation, a linear time-invariant (LTI) system: the impulse response captures everything the room does to any signal passing through it. By the convolution theorem, the room's output for an arbitrary input is the convolution of that input with the room impulse response. So if you measure the RIR of a real cathedral and convolve a dry signal with it, you get exactly what that dry signal would have sounded like played in that cathedral, at that source position, heard at that microphone position. This is convolution reverb, and it is the most realistic reverberation technology available, because it reproduces a real room rather than approximating one.

Capturing impulse responses

You cannot literally fire a perfect impulse — a real impulsive source (a balloon pop, a starter pistol) has limited and uneven energy. Two robust measurement methods are standard:

  • Sine sweep (exponential / logarithmic chirp). Play a slow sweep from low to high frequency through the space, record it, and deconvolve (using the known inverse sweep) to recover the impulse response. The swept-sine method, developed by Farina and others, gives an excellent signal-to-noise ratio and — crucially — pushes harmonic distortion of the loudspeaker out of the recovered linear impulse response, separating the distortion products into a time region that can be discarded. It is the de facto standard for high-quality IR capture.
  • MLS (maximum-length sequence). Excite the room with a pseudo-random binary sequence (deterministic white-noise-like signal) and cross-correlate the recording with the known sequence to recover the impulse response. MLS handles continuous noise well and was historically popular, but it is more sensitive to time variance and to loudspeaker non-linearity than the swept-sine method.

For spatial work, IRs are captured not with one microphone but with a multi-channel array — a stereo pair, a surround array, or an Ambisonic microphone (see Ambisonics) — so the directional structure of the early reflections and the diffuse field is preserved. A single mono IR convolved into many speakers is precisely the mistake the spatial-reverb section warns against.

Pros, cons, and CPU

Pros: unmatched realism for the captured condition; the room "is" the recording; no parameter tuning needed to sound authentic; ideal for matching a virtual source to a specific real space (post-production, virtual acoustics of a known hall).

Cons: the IR is fixed. It captures one source position, one listener position, one room state. You cannot move the source within the room, change the room size, or adjust the decay time without re-measuring or resorting to interpolation between many IRs. Modifying a captured IR (stretching its decay, EQ-ing the tail) quickly breaks the realism that justified using it. And convolution with a long IR is expensive: a 3-second stereo IR at 48 kHz is  ⁣144000\sim\!144\,000 taps per channel; direct-form convolution is hopeless, so implementations use partitioned (block) FFT convolution to keep latency low while amortising cost. CPU and memory scale with IR length times the number of input/output channel pairs — for a full object-based mix with per-object spatial IRs, this becomes the dominant cost. This fixedness is exactly why interactive and spatial systems often prefer the algorithmic and perceptual approaches below.

Algorithmic reverb

Schroeder's combs and allpasses

The founding architecture of artificial reverberation is Manfred Schroeder's, from the early 1960s at Bell Labs. Schroeder's insight was that a believable reverberant decay needs two properties — a dense echo pattern in time and a flat (uncoloured) frequency response — and that these can be built from two elementary recirculating filters:

  • A comb filter: a delay line of length DD fed back on itself with gain gg, producing a decaying train of echoes spaced DD apart. The decay time of one comb is set by DD and gg: the loop loses 20log10(g)20\log_{10}(g) dB every DD seconds, so RT60=D60/(20log10(g))=3D/(log10(g))\text{RT60} = D \cdot 60 / (-20\log_{10}(g)) = 3 D / (-\log_{10}(g)). A single comb is coloured — its frequency response is a series of regular peaks (the comb's teeth).
  • An allpass filter: a delay with a feedforward/feedback pair arranged so the magnitude response is flat at all frequencies while the phase (and hence the time/echo structure) is dispersed. Allpasses add echo density without adding colouration.

Schroeder's classic reverberator runs several parallel combs (with mutually prime delay lengths, so their teeth do not align and the combined response is smoother) into a series of allpasses (which multiply the echo density). With four combs and two allpasses he produced the first convincing artificial reverb. Moorer later improved it by adding a tapped early-reflection section and frequency-dependent feedback (a low-pass in the comb loop) so that high frequencies decay faster than lows — mimicking air and surface absorption, the natural behaviour every real room exhibits.

Feedback Delay Networks (FDN)

The modern generalisation, developed by Stautner and Puckette and refined extensively by Jot and Chaigne, is the Feedback Delay Network. An FDN is a bank of NN delay lines whose outputs are recombined by an N×NN \times N feedback matrix and fed back to the inputs. The structure generalises the parallel-comb idea: instead of NN independent loops, the feedback matrix mixes the delay outputs so that energy from each delay is scattered into all the others on every pass. This is what produces a genuinely dense, decorrelated, noise-like tail from a modest number of delays.

Three design choices govern an FDN:

  1. Delay lengths. Chosen mutually prime and spread over a range so the modal density is high and no periodicity is audible. The total of the delay lengths roughly sets the echo build-up time.
  2. The feedback matrix. Must be lossless (unitary / orthogonal) so that, on its own, it neither adds nor removes energy — it only scatters it. A unitary matrix guarantees stability and uniform mixing. Common choices are Hadamard or Householder matrices, which mix all delays into all others maximally.
  3. Attenuation / damping filters. Per-delay gains (and per-delay low-pass filters) set the decay time. Jot's key contribution was to make the decay frequency-dependent and exactly specifiable: by inserting an "absorptive" filter on each delay whose gain implements the desired RT60(f)\text{RT60}(f) curve for that delay's length, you can dial in any decay-time-versus-frequency profile independently of the network's mixing structure. This decouples what the decay does (a perceptual target) from how the density is built (the network) — the conceptual bridge to the perceptual approach in the next section.

Numeric example: tuning a comb/FDN delay for a target RT60

Suppose one delay line in an FDN is D=50D = 50 ms and we want this loop to contribute an RT60=1.5\text{RT60} = 1.5 s. The required feedback gain is found from RT60=3D/(log10(g))\text{RT60} = 3 D / (-\log_{10}(g)), so log10(g)=3D/RT60=3×0.050/1.5=0.10-\log_{10}(g) = 3 D / \text{RT60} = 3 \times 0.050 / 1.5 = 0.10, giving g=100.10=0.794g = 10^{-0.10} = 0.794. Each pass loses about 2 dB and the energy drops 60 dB after 30 loops, i.e. 30×50 ms=1.530 \times 50 \text{ ms} = 1.5 s, as intended. To make highs decay faster — say RT60=0.75\text{RT60} = 0.75 s above a crossover — we insert a low-pass whose high-frequency gain is ghf=103×0.050/0.75=100.2=0.631g_{hf} = 10^{-3 \times 0.050/0.75} = 10^{-0.2} = 0.631 per pass. This per-band gain shaping is exactly how Jot's absorbent filters produce a natural, air-like high-frequency roll-off in the tail (connect to distance & air, where air absorption does the same thing physically).

Pros and cons

Pros: cheap relative to long convolutions; fully parametric (change RT60, size, damping in real time); naturally produces decorrelated multichannel outputs (one tap per output channel from different points in the network); ideal for interactive and game audio where the room must change. Cons: harder to make sound like a specific real room; poorly tuned FDNs ring, flutter, or sound metallic (delay lengths too short or too few, matrix not properly mixing); the early-reflection section usually has to be built separately (a tapped delay or image-source model) because FDNs model the diffuse tail, not the geometry-specific early arrivals.

The perceptual room-model approach

Controlling reverb by what you hear, not by filters

Both convolution and raw algorithmic reverb expose the wrong controls for a creator who is thinking spatially. A convolution reverb offers you a fixed IR; an FDN offers you delay lengths, matrix coefficients, and feedback gains. Neither corresponds to anything a listener perceives. The major modern idea — developed by Jean-Marc Jot and colleagues at IRCAM in the Spatialisateur ("Spat") project in the 1990s, building on Griesinger's perceptual studies — is to invert this: expose perceptual factors as the controls and have the engine compute the underlying delays, gains, and filters automatically.

Jot's model rests on a statistical decomposition of the room response into four temporal/energetic segments — direct sound, early reflections, a "cluster" of late early reflections, and the reverberant tail — and a set of perceptual factors derived from psychoacoustic experiments, each mapped to a measurable acoustic quantity. The factors are designed to be mutually independent (orthogonal), so that adjusting one does not unintentionally change another. The principal factors are:

  • Source presence — the energy and prominence of the direct sound (intimacy, "in your face" versus distant). Tied to direct-sound level and direct/early ratio; a primary distance control.
  • Room presence / warmth of the room — the energy of the early room response relative to the direct sound; how strongly the room "answers" the source.
  • Running reverberance — the perceived liveliness during continuous sound, governed by the early decay time (EDT), not the full RT60. This is what you hear while music is playing, as opposed to in the gaps.
  • Late reverberance / heaviness — liveness — the length and weight of the tail in the gaps between sounds, governed by the late RT60\text{RT60} and its frequency balance.
  • Envelopment — the energy and decorrelation of the late, lateral reverberant field; how much the listener feels surrounded. Tied directly to direct-diffuse-and-envelopment.
  • Warmth — the ratio of low-frequency to mid-frequency reverberation time (bass richness of the decay).
  • Brilliance — the ratio of high-frequency to mid-frequency reverberation time (brightness and air of the decay).

Why this is powerful for spatial work

Three reasons this approach matters so much for spatial audio.

First, the controls match intent. A sound designer wants "this voice closer and the hall more enveloping", not "shorten delay tap 3 by 7 ms and rotate the feedback matrix". Mapping perceptual factors to engine parameters lets the creator work in the language of perception — exactly the language this guide has used since psychoacoustics.

Second, the factors are independent, so a mix is editable: you can increase envelopment without changing clarity, or add warmth without changing distance. Raw IRs and naive algorithmic reverbs couple these together, so every adjustment is a compromise.

Third — and this is the spatial payoff — the model separates the directional early reflections from the diffuse late field by construction. Because the engine builds the early reflections geometrically/directionally and the tail statistically, it can render them to different spatial layouts: early reflections panned by direction, late reverb decorrelated across the whole array. This is the architecture that DAM Audio's VRA (parametric spatial reverb) adopts — controlling a spatial reverberation field through perceptual, room-level parameters rather than per-tap filter coefficients, so that the same scene can be auditioned and rendered across formats while keeping the perceptual factors intact. See the VRA research page: https://dam-audio.com/research/vra-parametric-spatial-reverb .

The perceptual approach is, in short, the reconciliation of the two engines: convolution gives realism but no control, FDNs give control but not perceptual control. A perceptual room model gives perceptual control over an FDN-style engine, with measured-room data used to set the targets when realism is required.

Spatial reverberation

This section is the crux of the whole chapter and the place where the recurring theme of the guide is sharpest: a spatializer does not place a source — it reproduces the cues. For reverberation, that means reproducing the directional and correlation structure of the room response, not merely its level and decay.

Early reflections placed by direction

As established above, the perceptual value of an early reflection is bound to its direction of arrival: lateral reflections widen the source (ASW), vertical ones colour it, frontal ones reinforce localisation. A spatial reverberator must therefore render each early reflection as a positioned virtual source — panned (by amplitude panning, Ambisonics, or binaural HRTF, depending on the system) to the angle from which the corresponding image source would arrive. Done this way, the early reflections genuinely widen and place the source in a room. Done as an undifferentiated early "wash" sent equally to all outputs, they collapse to the centre and the spatial benefit is lost.

Late reverb decorrelated across the array

The late tail must be diffuse, and diffuse means decorrelated between every pair of reproduction channels. The physical diffuse field has decorrelated signals at the two ears above the Schroeder/binaural frequency; to reproduce that, every loudspeaker (or every Ambisonic component, or the two binaural channels) must carry a statistically independent version of the tail. An FDN provides this naturally by tapping different internal nodes for each output; a convolution approach provides it by using a genuinely multichannel IR with decorrelated channels; a perceptual engine provides it by design.

Why a mono tail fed to many speakers is wrong

Here is the failure mode in physical terms. Take one mono reverb tail and send the same signal to several loudspeakers. At the listening position, the copies arrive with slightly different path delays, and identical signals at different delays sum as a comb filter — a dense series of frequency-dependent reinforcements and cancellations. Instead of a smooth, enveloping diffuse field you get a coloured, position-sensitive, collapsed sound that pulls toward the nearest speaker and changes timbre as you move your head. The decorrelation that produces envelopment is exactly what was thrown away.

This is the multichannel generalisation of a point made in stereo is already spatial: correlation between channels controls width and image, and decorrelation is what creates space. Two channels of identical reverb collapse to a point in the middle; two channels of decorrelated reverb open into a wide, enveloping field. Scale that to a surround or 3D array and the principle is the same and more important: the late tail must be decorrelated across all outputs, or envelopment dies and comb-colouration appears. The relationship between interaural cross-correlation (IACC), decorrelation, and envelopment is developed fully in direct, diffuse & envelopment; the takeaway here is that spatial reverb is defined precisely by getting the correlation structure right, not just the decay time.

Reverberation as the dominant distance cue

Why the direct/reverberant ratio beats loudness

The most reliable distance cue in a room is the ratio of direct to reverberant energy (D/R), and the reason is elegant. As a source moves away, the direct sound obeys the inverse-distance law and drops about 6 dB per doubling of distance (see distance & air). But the reverberant field is, to first order, roughly uniform throughout the room — the diffuse energy depends on the room's total absorption and the source power, not on where in the room you stand. So as the source recedes, the direct sound falls while the reverberant level stays put, and the D/R ratio falls steadily with distance. The auditory system reads this falling ratio directly as increasing distance — and it works even when overall loudness is ambiguous, because absolute level depends on the source's effort (a whisper versus a shout) while the ratio does not. This robustness is why reverberation, not loudness, is the dominant distance cue indoors.

Critical distance as the natural unit of indoor distance

The critical distance DcD_c calculated earlier is the natural yardstick: at DcD_c, D/R is 0 dB (equal energy); at half DcD_c, the direct dominates by  ⁣6\sim\!6 dB; at twice DcD_c, the reverberant dominates by  ⁣6\sim\!6 dB. A spatializer rendering distance must therefore manipulate D/R relative to the modelled room's critical distance, not merely turn down the dry level.

Common mistake

Lowering only the dry signal without raising or holding the reverberant level moves the source "away" and makes the whole scene quieter — wrong. Holding reverberant energy roughly constant while attenuating direct sound (and adding pre-delay and high-frequency air loss) is the correct recipe, and it is exactly the recipe the perceptual "source presence" factor encodes.

Note also the directivity coupling from earlier: a more directive source has a larger on-axis DcD_c, so the same physical distance reads as closer — which is why a spatializer that models source directivity (source directivity) and reverb together produces far more convincing distance than one that models either alone.

How tools expose reverb, common mistakes, and limits

How tools expose this — the actual parameters

Across convolution, algorithmic, and perceptual reverbs, the parameters you will actually meet fall into recognisable groups. Knowing which underlying mechanism each one drives is half the battle.

Parameter (common names)What it really controlsUnderlying mechanism
Decay / Reverb time / RT60length of the diffuse tailfeedback gain / IR length
Size / Room sizedelay lengths, early-reflection spacing, mixing timedelay-line lengths; image-source room dimensions
Pre-delaygap between direct sound and first reverba delay before the reverb input
Early/Late mix, ER levelbalance of early reflections vs tailtap gains vs network output
Damping / HF decay / Hi cuthow fast highs decay vs lows (brilliance, air)low-pass in feedback loops (Jot absorbent filters)
Diffusionecho density / smoothness of build-upallpass count, matrix mixing
Dry/Wet, Mixdirect-to-reverberant ratiooutput gains — your distance control
Width / Stereo / Spreaddecorrelation between output channelsindependent taps / decorrelation filters
Modulation / Chorusslow delay-length modulation to break periodicityLFO on delay taps

In a perceptual engine such as Spat or DAM VRA, the top layer instead exposes presence, room presence, running reverberance, envelopment, warmth, brilliance, with the table above computed underneath.

astuce

Recognising that "Dry/Wet" is your distance fader, and "Width" is your decorrelation/envelopment control, lets you use any reverb spatially regardless of how it is labelled.

Common mistakes and pitfalls

  • Sending one mono tail to many loudspeakers. The comb-filtering / collapse failure described above. Always decorrelate the tail across outputs. This is the single most common and most damaging spatial-reverb error.
  • Using "Dry/Wet" as the only distance control. Turning down dry without holding reverberant energy makes the source quieter, not farther. Manage D/R, add pre-delay, add HF air loss.
  • Omnidirectional early reflections. Early energy thrown equally to all channels does not widen the source — only lateral early energy does. Place early reflections by direction.
  • Ignoring frequency-dependent decay. Real rooms (and air) damp highs faster than lows. A tail with a flat decay across frequency sounds artificial, metallic, and "stuck". Use damping / brilliance controls.
  • Stretching or EQ-ing a captured IR aggressively. Convolution's realism comes from the IR being a faithful record; large manipulations destroy exactly that. If you need to change the room substantially, use an algorithmic or perceptual engine instead.
  • Too-short or too-few FDN delays. Produces audible flutter, ringing, and metallic colouration because the echo density and modal density are insufficient. Use enough mutually prime delays.
  • Mismatched early and late spaces. Early reflections modelling a small room glued to a tail modelling a cathedral give contradictory size cues; the ear notices. Keep size cues consistent across the whole response.
  • Reverb that contradicts the pan. A source panned hard left whose early reflections and reverb arrive symmetrically from the right of the room, or whose D/R says "close" while the pan says "far corner". The cues must agree, or the illusion breaks — the central thesis of this guide.

Limits

Several limits bound what any reverb model can do.

The LTI assumption underlying convolution fails for moving sources (the room response changes as the source moves — see Doppler), for time-varying rooms (opening doors, moving people), and for non-linear sources. A single IR is a snapshot.

Diffuse-field idealisation underlies the statistical descriptors (RT60, Schroeder frequency, critical distance) and the FDN tail. Real rooms are not perfectly diffuse: non-uniform absorption, coupled spaces, and strong single reflections produce non-exponential, double-sloped, or position-dependent decays that simple models miss. Below the Schroeder frequency the room is modal, not statistical, and no statistical reverb model is valid there — the bass behaviour is dominated by individual room modes that depend on exact geometry and position.

Reproduction limits. Decorrelating a tail across NN loudspeakers approximates a diffuse field only over a limited listening area and frequency range; outside the sweet spot, or below the array's spatial aliasing frequency, true diffuseness cannot be reconstructed (this couples to the limits discussed under WFS and Ambisonics). Binaural reproduction of envelopment depends on individual HRTFs and head movement (see binaural). And the mixing time / early-reflection budget is bounded by CPU: a full geometric early-reflection model for many moving objects in a complex room is expensive, so real systems truncate the image-source order and hand off to the statistical tail earlier than physics would.

Finally, perceptual orthogonality is approximate. Jot's factors are designed to be independent, but at extreme settings they interact, and the mapping from factor to engine parameter is calibrated for typical rooms; push beyond that envelope and the independence weakens.

Worked example: small dry room versus large hall

To pull the chapter's numbers together, follow the same dry source — a voice — placed first in a small dry room and then in a large concert hall, both at a listening distance of 8 m, and see how the room transforms it.

The two spaces

Small dry room: a treated studio live room, V=150V = 150 m³ (e.g. 7.5×5×47.5 \times 5 \times 4), surface area S145S \approx 145 m², average absorption αmean=0.30\alpha_{\text{mean}} = 0.30 (treated).

  • Absorption A=145×0.30=43.5A = 145 \times 0.30 = 43.5 sabins.
  • Sabine RT60=0.161×150/43.5=24.15/43.5=0.56\text{RT60} = 0.161 \times 150 / 43.5 = 24.15 / 43.5 = 0.56 s. Eyring: ln(0.70)=0.357-\ln(0.70) = 0.357, Aeyr=145×0.357=51.7A_{\text{eyr}} = 145 \times 0.357 = 51.7, RT60=24.15/51.7=0.47\text{RT60} = 24.15/51.7 = 0.47 s. Call it  ⁣0.5\sim\!0.5 s.
  • Critical distance Dc=0.057150/0.5=0.057300=0.057×17.3=0.99D_c = 0.057 \sqrt{150/0.5} = 0.057 \sqrt{300} = 0.057 \times 17.3 = 0.99 m — about 1 m.
  • Schroeder frequency fS=20000.5/150=2000×0.0577=115f_S = 2000 \sqrt{0.5/150} = 2000 \times 0.0577 = 115 Hz.
  • Mixing time tmix150=12t_{\text{mix}} \approx \sqrt{150} = 12 ms.
  • Earliest reflections (room a few metres across, source/listener mid-room): first reflections within  ⁣515\sim\!5\text{–}15 ms of the direct sound; lateral reflections present but close.

Large hall: V=15000V = 15\,000 m³, surface area S4000S \approx 4000 m², average absorption αmean=0.12\alpha_{\text{mean}} = 0.12 (mostly hard surfaces plus audience).

  • Absorption A=4000×0.12=480A = 4000 \times 0.12 = 480 sabins.
  • Sabine RT60=0.161×15000/480=2415/480=5.0\text{RT60} = 0.161 \times 15000 / 480 = 2415 / 480 = 5.0 s — high; a real hall with audience absorption would be tuned nearer 2 s, but with the surfaces alone this bare-shell figure illustrates the contrast. Take a furnished-and-occupied RT602.0\text{RT60} \approx 2.0 s for the perceptual comparison.
  • Critical distance Dc=0.05715000/2.0=0.0577500=0.057×86.6=4.9D_c = 0.057 \sqrt{15000/2.0} = 0.057 \sqrt{7500} = 0.057 \times 86.6 = 4.9 m — about 5 m.
  • Schroeder frequency fS=20002.0/15000=2000×0.0115=23f_S = 2000 \sqrt{2.0/15000} = 2000 \times 0.0115 = 23 Hz — essentially the entire audible band is statistical and smooth.
  • Mixing time tmix15000=122t_{\text{mix}} \approx \sqrt{15000} = 122 ms — a long, audibly evolving early section.
  • Earliest reflections (walls tens of metres away): first side-wall and ceiling reflections may arrive 30–80 ms after the direct sound, with a long sparse early sequence before the tail mixes.

Contrasting the percepts at 8 m

QuantitySmall dry roomLarge hall
RT60 ⁣0.5\sim\!0.5 s ⁣2.0\sim\!2.0 s
Critical distance DcD_c ⁣1\sim\!1 m ⁣5\sim\!5 m
Listener at 8 m, D/R ⁣18\sim\!18 dB below DcD_c boundary (deep in reverberant field) ⁣4\sim\!4 dB below boundary (reverberant but direct still present)
Schroeder frequency ⁣115\sim\!115 Hz (audible modal bass) ⁣23\sim\!23 Hz (smooth throughout)
Mixing time ⁣12\sim\!12 ms (jumps to dense tail) ⁣122\sim\!122 ms (long evolving early field)
First reflection delay ⁣515\sim\!5\text{–}15 ms ⁣3080\sim\!30\text{–}80 ms
Perceived sizesmall, intimate, slightly boxy bassvast, grand
Perceived distance of sourceclearly "in the room" but not far (short tail caps the sense of distance)distant, set back, immersed
Apparent source widthmodest (close lateral reflections)large (strong, delayed lateral early energy)

Read the distance cue carefully. In the small room, the listener at 8 m is eight times the critical distance, so the direct sound is far below the reverberant level — yet because the tail is short (0.5 s) and the early reflections arrive within  ⁣15\sim\!15 ms, the room reads as small and close: the source is "in a small room with you", reverberant but not far, because a small room cannot make anything sound far. In the hall, 8 m is only  ⁣1.6\sim\!1.6 times DcD_c, so proportionally more direct sound survives — but the 30–80 ms gap before the first reflections, the 2-second tail, and the strong lateral early energy read unmistakably as a large space with the source set back into it. The same dry voice is, in one case, a person an arm's length into a small room, and in the other, a singer on a distant stage in a cathedral — and the only thing that changed was the reverberation.

The whole argument in one line

The dry source was identical. Direction, if we had panned it, would have been identical. Everything that made the two scenes different places lived in the reflections and the tail — their timing, their decay, their direction, and their decorrelation.

That is the entire argument of this chapter in one experiment. A spatializer that reproduces those four things faithfully, with perceptual factors that stay mutually consistent and consistent with the pan, places the source in a believable world. One that feeds a single reverb tail to many speakers and calls a wet/dry knob "distance" places it nowhere. DAM's VRA approach — parametric, perceptual, and spatial by construction — exists precisely to keep those four cues coherent across formats, from a stereo pair to a full 3D array.

For the cues that precede reverberation in the signal chain — the inverse-distance law, pre-delay, and air absorption — see distance & air; for the diffuse-field and envelopment physics this chapter leans on, see direct, diffuse & envelopment; and for how reverberation interacts with a source's own radiation pattern, see source directivity.

References

  1. Kuttruff, H. Room Acoustics, 6th ed., CRC Press, 2017. The standard textbook on the physics of rooms: image sources, diffuse fields, decay, statistical acoustics.
  2. Sabine, W. C. Collected Papers on Acoustics, Harvard University Press, 1922. The founding work on reverberation time and the Sabine equation.
  3. Schroeder, M. R. "Natural Sounding Artificial Reverberation." Journal of the Audio Engineering Society, 10(3), 1962. The origin of comb/allpass artificial reverberation and of backward-integration decay measurement.
  4. Jot, J.-M., and Chaigne, A. "Digital Delay Networks for Designing Artificial Reverberators." 90th AES Convention, 1991. Feedback Delay Networks with specifiable frequency-dependent decay.
  5. Jot, J.-M. "Real-time spatial processing of sounds for music, multimedia and interactive human-computer interfaces." Multimedia Systems, 7(1), 1999. The IRCAM Spatialisateur and the perceptual room-model approach.
  6. Griesinger, D. "The Psychoacoustics of Apparent Source Width, Spaciousness and Envelopment in Performance Spaces." Acta Acustica united with Acustica, 83, 1997. Perceptual factors of spatial impression.
  7. Allen, J. B., and Berkley, D. A. "Image method for efficiently simulating small-room acoustics." Journal of the Acoustical Society of America, 65(4), 1979. The image-source method for early reflections.
  8. ISO 3382-1:2009, Acoustics — Measurement of room acoustic parameters — Part 1: Performance spaces. Definitions and measurement of RT60, EDT, C80, D50, LF, and related descriptors.
  9. Beranek, L. Concert Halls and Opera Houses: Music, Acoustics, and Architecture, 2nd ed., Springer, 2004. Real-hall acoustic data and the perceptual interpretation of room descriptors.

← Back to the Sound Field & the Room