Glossary

This glossary collects the key terms used throughout the guide, grouped by theme. Each entry is a precise, self-contained definition; where a formula clarifies the concept it is given in math. Use it as a reference while reading the chapters — the deeper treatment of most of these terms lives in Spatial Psychoacoustics, Formats and Channel Layouts, the Techniques part, and the Field and Room part.

Perception

Azimuth. The horizontal angle of a sound source relative to the listener's forward direction, usually measured in degrees, positive to one side (commonly the left or right depending on convention). Together with elevation and distance it specifies a direction in space.

Elevation. The vertical angle of a source above or below the horizontal plane through the ears, typically ranging from $-90°$ (directly below) through $0°$ (ear level) to $+90°$ (directly overhead).

Distance. The perceived radial range from listener to source, inferred from loudness, the direct-to-reverberant ratio, high-frequency air absorption, and parallax cues rather than from any single dedicated channel of the auditory system.

ITD (Interaural Time Difference). The difference in arrival time of a sound at the two ears, the dominant lateral cue below roughly $1.5$ kHz. A useful low-frequency approximation is the Woodworth model $ITD \approx \frac{r}{c}\,(\theta + \sin\theta)$ , with head radius $r$ , sound speed $c$ , and azimuth $\theta$ in radians.

ILD (Interaural Level Difference). The difference in sound pressure level between the two ears caused by head shadowing, the dominant lateral cue above roughly $1.5$ kHz where the head is large relative to the wavelength.

Duplex theory. Lord Rayleigh's principle that lateral localization is served by two complementary mechanisms across frequency: ITD at low frequencies and ILD at high frequencies, with a transition region around $1$ – $1.5$ kHz where neither is fully reliable.

Rule of thumb

The duplex split tracks wavelength versus head size: ITD dominates below $\approx 1.5$ kHz, ILD above it, with the crossover near $1$ – $1.5$ kHz where neither cue is fully reliable.

Pinna. The visible outer ear, whose folds introduce frequency-dependent reflections and notches that vary with direction; these spectral cues are essential for resolving elevation and front/back ambiguity.

HRTF (Head-Related Transfer Function). The frequency-domain filter describing how a given ear transforms a free-field sound from a specific direction, encoding ITD, ILD, and pinna/torso spectral cues into a single direction-dependent response.

HRIR (Head-Related Impulse Response). The time-domain counterpart of the HRTF; convolving a mono signal with the left and right HRIRs for a direction renders that source at that direction over headphones.

SOFA (Spatially Oriented Format for Acoustics). The AES69 standard file format for storing HRTF/HRIR sets and other space-dependent acoustic data, enabling portable, interchangeable measured-ear datasets.

Cone of confusion. The set of directions sharing nearly identical ITD and ILD (approximately a cone about the interaural axis), within which front/back and up/down confusions arise unless resolved by spectral cues or head movement.

note

Inside a cone of confusion, ITD and ILD alone cannot disambiguate a direction — spectral (pinna) cues or head movement are what break the front/back and up/down tie.

Precedence effect (Haas effect). The perceptual dominance of the first-arriving wavefront in localization: when a sound is followed within roughly $1$ – $40$ ms by a copy, the pair is heard as a single event localized toward the leader. See reverberation.

Echo threshold. The delay at which a lagging copy of a sound ceases to fuse with the leader and is heard as a separate echo, typically tens of milliseconds for speech and shorter for transients.

Externalization. The perception that a sound originates outside the head in the surrounding space rather than inside it; strong externalization requires plausible HRTF cues, often reverberation, and ideally head-tracking. Central to binaural reproduction.

Minimum audible angle (MAA). The smallest change in source direction a listener can reliably detect; it is finest near the front (around $1°$ in azimuth) and degrades toward the sides and for elevation.

Phantom image. A virtual source perceived between two or more real loudspeakers when they emit coherent, related signals, the basis of stereo and amplitude-panned imaging.

Summing localization. The mechanism by which coherent signals from multiple loudspeakers arriving within about $1$ ms are summed by the auditory system into a single phantom image whose position depends on their relative level and timing.

Apparent source width (ASW). The perceived broadening of a source beyond its physical extent, increased by early lateral reflections and low inter-channel coherence; a key component of spaciousness. See direct, diffuse and envelopment.

Listener envelopment (LEV). The sense of being surrounded and immersed in a sound field, driven mainly by late-arriving lateral and diffuse energy reaching the listener from many directions.

IACC (Interaural Cross-Correlation Coefficient). A measure of the similarity of the signals at the two ears, $IACC = \max_{\tau}\big|\,\rho_{LR}(\tau)\,\big|$ over a short lag window; low IACC correlates with wider, more enveloping perception.

Reproduction concepts

Sweet spot. The listening position (or small region) where a loudspeaker system's intended imaging, balance, and timing align; image quality degrades as the listener moves away from it.

Decorrelation. The deliberate reduction of similarity between channels (via delays, all-pass filters, or phase randomization) to widen images, increase envelopment, and reduce comb filtering between loudspeakers.

Crosstalk. In loudspeaker reproduction, the unintended path by which the left-ear signal also reaches the right ear and vice versa, which corrupts binaural cues delivered over speakers.

Crosstalk cancellation (XTC). A filtering technique that pre-conditions loudspeaker feeds so the unwanted crosstalk paths are destructively cancelled at the ears, enabling binaural delivery over speakers. The basis of transaural reproduction.

Inter-channel coherence. The normalized correlation between two reproduction channels; high coherence yields a tight central phantom, low coherence yields width and diffuseness. Distinct from IACC, which is measured at the ears.

Don't confuse the two

Inter-channel coherence is measured between the reproduction channels, while IACC is measured at the listener's ears — related ideas, but not the same quantity.

Renderer. The software or hardware that converts a scene description (objects plus metadata, or an ambisonic signal) into loudspeaker or headphone feeds for a specific playback layout.

Decoder. The stage that maps an encoded spatial representation — most often Ambisonics — onto a concrete loudspeaker layout, computing the gains that reconstruct the encoded sound field.

Upmix. The process of deriving more output channels than exist in the source, for example creating a surround or immersive presentation from a stereo recording by extracting ambience and steering content.

Downmix. The process of combining a higher-channel-count signal into fewer channels (for example $7.1.4$ to stereo) using a defined set of summing coefficients, ideally preserving balance and avoiding cancellation.

Representations and formats

Channel-based. A representation in which each signal is tied to a specific loudspeaker position, so the layout is fixed at production time. The classic model behind stereo and surround. See Formats.

Object-based. A representation in which sources are stored as mono (or few-channel) audio plus positional and behavioural metadata, rendered to the actual playback layout at runtime. Detailed in object-based audio.

Scene-based. A layout-independent representation that encodes the whole sound field as a set of basis functions (notably Ambisonics) rather than as discrete sources or channels.

Bed. A channel-based foundation layer in an immersive mix (for example a $7.1.4$ bed) over which dynamic objects are placed, carrying ambience and content that need not move.

LFE (Low-Frequency Effects). A dedicated band-limited channel for low-frequency content (the ".1" in surround notation), reproduced by a subwoofer and intended as an additive effects channel rather than a redirected-bass channel.

Common mistake

The LFE channel is an additive effects channel, not the place where the main channels' bass is redirected. That redirection is the job of bass management — keep the two roles separate.

Bass management. The system that high-pass-filters main channels and routes their low frequencies (summed with any LFE) to a subwoofer, so full-range content plays correctly on speakers of limited extension.

X.Y.Z notation. The shorthand for an immersive loudspeaker layout: $X$ full-range channels at ear level, $Y$ LFE channels, $Z$ height channels; for example $7.1.4$ means seven, one, and four. See Formats.

ADM (Audio Definition Model). An ITU-R metadata model (BS.2076) that describes the content, format, and position of channels, objects, and scene-based audio, providing a standard way to author and exchange immersive programmes.

BW64 (Broadcast Wave 64-bit). A WAV-based file format (ITU-R BS.2088) supporting files larger than $4$ GB and embedding ADM metadata, the common container for object-based deliverables.

B-format. The first-order Ambisonics signal set, classically the components $W$ (omnidirectional pressure) and $X, Y, Z$ (figure-of-eight gradients), describing the sound field at a point.

ACN (Ambisonic Channel Number). The standard ordering scheme for ambisonic components, indexing each spherical-harmonic channel by $ACN = n^2 + n + m$ for degree $n$ and order $m$ .

SN3D (Schmidt semi-normalized). The component normalization convention paired with ACN in the AmbiX format, defining the relative scaling of the spherical-harmonic channels.

AmbiX. The de facto interchange convention for Ambisonics combining ACN channel ordering with SN3D normalization, widely used in VR and production tools.

Ambisonic order. The truncation degree $N$ of the spherical-harmonic expansion; higher order gives finer spatial resolution at the cost of more channels, with $(N+1)^2$ channels for full-sphere (periphonic) signals.

Matrix encoding. A technique that folds extra spatial channels into a smaller channel count using amplitude/phase relationships (for example $4$ : $2$ : $4$ systems), to be recovered by a matched decoder. See surround and matrix.

Spatialization techniques

Panning. Distributing a single source's signal across two or more loudspeakers to place a phantom image between them, most commonly by adjusting relative gains. See amplitude panning.

Panning law. The rule relating a pan position to the per-speaker gains, chosen so perceived loudness stays constant across the pan; common laws apply a $-3$ dB or $-4.5$ dB centre attenuation, with constant-power panning using $g_L = \cos\theta,\; g_R = \sin\theta$ so that $g_L^2 + g_R^2 = 1$ .

VBAP (Vector Base Amplitude Panning). A geometric panning method that places a source by computing gains for the two or three loudspeakers spanning its direction, expressing the unit source vector as a non-negative combination of the speaker base vectors.

DBAP (Distance-Based Amplitude Panning). A panning method that assigns gains from the distances between a virtual source position and every loudspeaker, working without assumptions about a central sweet spot or a regular layout.

MDAP (Multiple-Direction Amplitude Panning). An extension of VBAP that spreads a source over several nearby directions to keep its perceived width and energy more constant as it moves across and between speakers.

Ambisonics. A scene-based technique that encodes a sound field into spherical-harmonic components and decodes it to any loudspeaker layout or to binaural, decoupling production from playout. See Ambisonics.

Binaural. Reproduction that recreates the exact pressure signals at a listener's two ears, usually over headphones via HRTF filtering, to evoke full three-dimensional localization. See binaural.

Transaural. The delivery of binaural signals over loudspeakers using crosstalk cancellation so that each ear receives predominantly its intended signal. See transaural.

Wave field synthesis (WFS). A technique that uses a dense array of loudspeakers to physically reconstruct a target wavefront over an extended area, producing sources with consistent localization independent of listening position. See WFS.

Huygens-Fresnel principle. The principle that any wavefront can be reconstructed by treating every point on a surface as a secondary source; it provides the physical foundation of wave field synthesis.

Spatial aliasing. In array-based reproduction (WFS, ambisonic decoding), the high-frequency artefacts that appear when the spacing of secondary sources is larger than half a wavelength, $d > \tfrac{c}{2 f}$ , blurring imaging above the aliasing frequency.

Pitfall

Once the secondary-source spacing exceeds half a wavelength, $d > \tfrac{c}{2 f}$ , spatial aliasing sets in and imaging blurs above that aliasing frequency. Loudspeaker spacing therefore sets a hard ceiling on the usable bandwidth of an array.

Focused source. In WFS, a virtual source rendered in front of the loudspeaker array (between array and listener), formed by a converging wavefront that focuses at the intended point before diverging.

The room and the field

Sound field. The spatial-temporal distribution of acoustic pressure and particle velocity in a region; spatial audio aims to recreate or evoke a desired sound field at the listener.

Near-field. The region close to a source where the field is complex and pressure and velocity are not in phase, and where level falls faster than the simple inverse-distance law; beyond it lies the far field where $p \propto 1/r$ holds.

Direct-to-reverberant ratio (DRR). The energy ratio of the direct sound to the reverberant sound at the listener, a primary distance cue; it decreases with distance and is often expressed in dB. See distance and air.

Critical distance. The distance from a source at which direct and reverberant energies are equal, $r_c \approx 0.057\sqrt{V/RT_{60}}$ (with room volume $V$ in m³); beyond it the reverberant field dominates.

RT60. The reverberation time, the duration for the sound energy to decay by $60$ dB after a source stops; Sabine's estimate is $RT_{60} = \frac{0.161\,V}{A}$ with volume $V$ in m³ and total absorption $A$ in m² sabins. See reverberation.

EDT (Early Decay Time). The reverberation time derived from the initial $10$ dB of the decay and scaled to $60$ dB; it correlates better than $RT_{60}$ with the perceived reverberance of a room.

C80 (Clarity). The ratio of early ( $0$ – $80$ ms) to late (after $80$ ms) energy in the impulse response, $C_{80} = 10\log_{10}\!\frac{\int_0^{80}p^2\,dt}{\int_{80}^{\infty}p^2\,dt}$ dB, indicating the clarity of musical detail.

Early reflections. The first discrete reflections arriving after the direct sound (roughly within $50$ – $80$ ms), which influence perceived source width, timbre, and the impression of room size.

Diffuse field. A field in which acoustic energy arrives uniformly from all directions with random phase, approximated by the late reverberant tail of a room.

Schroeder frequency. The frequency above which room modes overlap densely enough to be treated statistically rather than individually, $f_S \approx 2000\sqrt{RT_{60}/V}$ (Hz, with $V$ in m³); below it the room behaves modally.

Feedback delay network (FDN). An efficient reverberation algorithm built from several delay lines interconnected by a feedback mixing matrix, generating a dense, controllable late-reverberation tail.

Convolution reverb. Reverberation produced by convolving a signal with a measured room impulse response, reproducing the captured space's reflections and decay with high realism.

Doppler effect. The shift in observed frequency caused by relative motion between source and listener, $f' = f\,\frac{c}{c - v_s}$ for a source approaching at speed $v_s$ ; relevant to moving objects in dynamic spatial scenes.

Air absorption. The frequency-dependent attenuation of sound by the atmosphere, rising with frequency, humidity, and distance; it dulls distant sources and acts as a distance cue. See distance and air.

Units and metrics

Inverse-distance law. The far-field relation that sound pressure falls inversely with distance, $p \propto 1/r$ , so level decreases by $-6$ dB per doubling of distance for a point source in a free field.

Directivity. The variation of a source's (or microphone's) radiated/received level with direction, described by a polar pattern; it shapes how much direct versus reflected energy reaches a listener.

Directivity index (DI). A single-number measure of how concentrated a source's output is on-axis relative to an omnidirectional source, $DI = 10\log_{10} Q$ dB, where $Q$ is the directivity factor.

Decibel (dB). A logarithmic ratio unit; for pressure $L = 20\log_{10}(p/p_{0})$ dB and for power/energy $L = 10\log_{10}(P/P_{0})$ dB, with $p_0 = 20\ \mu$ Pa the reference for SPL.

Sound pressure level (SPL). The decibel measure of acoustic pressure relative to $20\ \mu$ Pa, the standard scale for loudness of a sound field.

Wavelength. The spatial period of a sound wave, $\lambda = c/f$ ; at $c \approx 343$ m/s a $1$ kHz tone has $\lambda \approx 0.34$ m, comparable to head size, which is why the duplex transition sits near there.

Sound speed. The propagation speed of sound, about $343$ m/s in air at $20°$ C, increasing roughly by $0.6$ m/s per degree Celsius; it sets the time scale for ITD, reflections, and array spacing.

← Back to Fundamentals

Perception​

Reproduction concepts​

Representations and formats​

Spatialization techniques​

The room and the field​

Units and metrics​