Wave Field Synthesis

Every spatialization technique covered so far in Part II shares a common trick: it manufactures an illusion of a source somewhere in space without ever physically reconstructing the sound field that a real source would create. Stereo and amplitude panning build a phantom image by feeding correlated signals to two or more loudspeakers, exploiting the way the auditory system fuses them into one perceived direction. Ambisonics encodes a spherical-harmonic approximation of the field and reconstructs it correctly only in a small region around the centre. Binaural synthesis sidesteps loudspeakers entirely and delivers the two ear signals directly. All of these are perceptual methods: they are engineered around the listener's ears and brain, and they work best at one privileged location — the sweet spot.

Wave Field Synthesis (WFS) takes a fundamentally different stance. Instead of asking "what two signals will fool the ears at the sweet spot?", it asks "what set of loudspeaker signals will physically recreate, throughout a whole region of space, the same acoustic wavefront that a real source would produce?" If you can rebuild the actual pressure field, then there is no sweet spot at all: anyone standing anywhere inside the reconstructed region hears a correct, physically consistent wavefront, with correct interaural time and level differences for their head position, correct motion parallax as they walk, and a source localisation that holds up across the whole audience. This is the headline promise of WFS, and it is also the reason WFS is so demanding in hardware: physical reconstruction of a continuous wavefront requires a dense, large array of loudspeakers driven by individually computed signals.

This chapter develops WFS from first principles. We begin with Huygens' insight that a wavefront is itself a collection of sources, formalise it through the Kirchhoff-Helmholtz and Rayleigh integrals, and from there derive the loudspeaker driving functions that place virtual sources behind the array and focused sources in front of it. We then confront the two great practical limits — spatial aliasing set by loudspeaker spacing, and the amplitude errors of so-called 2.5D synthesis with horizontal-only arrays — before examining truncation, room interaction, history, and finally the elegant observation that WFS is, at bottom, a delay-and-gain engine that contains amplitude panning as a degenerate special case.

The distinguishing idea: reconstruct the wavefront, not the image

Phantom image versus physical field

Consider what happens when a real trumpet plays three metres in front of you and slightly to your left. The trumpet radiates a roughly spherical wave. By the time that wave reaches you, it is a smoothly curved wavefront sweeping across your head: it reaches your left ear a fraction of a millisecond before your right, with a slightly higher level there, and the curvature of the wavefront encodes the distance. If you take a step to the side, the geometry updates continuously and instantly — the relationships are governed by physics, not by any assumption about where your head is.

Now consider a stereo phantom image of that same trumpet. Two loudspeakers each radiate their own spherical wave; the two waves superpose at your ears. At the sweet spot, the summed inter-ear differences happen to match a trumpet in the intended direction, and you localise it there. But step to the side and the illusion collapses: the precedence effect pulls the image toward the nearer speaker, the level balance shifts, and the "trumpet" jumps. The phantom was never a wavefront — it was a coincidence engineered for one point.

WFS aims to make the loudspeaker array radiate a combined wavefront indistinguishable, over an extended area, from the trumpet's. The array becomes a kind of acoustic "window": you look (listen) through it into a virtual scene, and the sources in that scene behave like real ones because the field on your side of the window is genuinely the field they would have produced.

Why this matters: from sweet spot to sweet area

The practical payoff is the elimination of the single sweet spot. In a cinema using amplitude-panned surround, a listener in the front-left seats hears dialogue panned to centre as coming from the left wall, because for them the centre and left channels no longer balance. In a WFS cinema, the dialogue is rendered as a virtual source at a fixed position behind the screen, and every seat reconstructs a wavefront emanating from that position — so dialogue stays glued to the actor's mouth no matter where you sit. The cost of this fidelity is that the "window" must be sampled finely enough in space, which is the entire engineering challenge we develop below.

This is why WFS is sometimes described as the most object-native of the loudspeaker methods. Like object-based audio, WFS thinks in terms of sources at positions; unlike channel-based panning, it computes a true physical rendering of each object for the array geometry actually installed. DAM Audio's WFS-X renderer and the RIPL spatial engine both adopt this position-first viewpoint.

The physics: Huygens, Kirchhoff-Helmholtz, and secondary sources

Huygens-Fresnel: a wavefront is made of sources

The conceptual seed of WFS is more than three centuries old. In 1690 Christiaan Huygens proposed that every point on a propagating wavefront can be regarded as the source of a tiny secondary spherical wavelet, and that the wavefront at a later instant is the envelope of all these wavelets. Augustin-Jean Fresnel later added the crucial refinement that the wavelets interfere, their amplitudes and phases summing coherently, which explains diffraction. Together this is the Huygens-Fresnel principle.

The implication for sound reproduction is direct and striking. Suppose a virtual source somewhere behind a surface radiates toward us. Wherever its wavefront crosses that surface, we could — in principle — replace the source entirely by placing real secondary sources on the surface, each emitting a wavelet with exactly the amplitude and phase the original wavefront had there. Downstream of the surface, the superposition of all those secondary wavelets would rebuild precisely the field the original source would have produced. The original source can be discarded; the surface of secondary sources stands in for it perfectly. A line or plane of loudspeakers is that surface of secondary sources. This is the whole idea of WFS in one sentence.

The Kirchhoff-Helmholtz integral

Huygens' principle becomes quantitative through the Kirchhoff-Helmholtz integral, which is an exact consequence of the wave equation. It states that the sound pressure $P(\mathbf{r}, \omega)$ at any point $\mathbf{r}$ inside a closed, source-free volume $V$ is completely determined by the pressure and its normal gradient on the bounding surface $S$ :

P(\mathbf{r}, \omega) = \oint_{S} \left[ G(\mathbf{r}\,|\,\mathbf{r}_S) \, \frac{\partial P(\mathbf{r}_S, \omega)}{\partial \mathbf{n}} - P(\mathbf{r}_S, \omega) \, \frac{\partial G(\mathbf{r}\,|\,\mathbf{r}_S)}{\partial \mathbf{n}} \right] dS .

Here $\mathbf{r}_S$ runs over the surface, $\mathbf{n}$ is the inward surface normal, $\omega = 2\pi f$ is angular frequency, and $G(\mathbf{r}\,|\,\mathbf{r}_S)$ is the free-field Green's function — the pressure at $\mathbf{r}$ due to a unit point (monopole) source at $\mathbf{r}_S$ :

G(\mathbf{r}\,|\,\mathbf{r}_S) = \frac{e^{-jk\,|\mathbf{r} - \mathbf{r}_S|}}{4\pi\,|\mathbf{r} - \mathbf{r}_S|}, \qquad k = \frac{\omega}{c},

with $c \approx 343\ \text{m/s}$ the speed of sound and $k$ the wavenumber. The two terms inside the integral have a clean physical meaning. The first term is a distribution of monopole secondary sources whose strength is set by the normal pressure gradient $\partial P / \partial \mathbf{n}$ (equivalently, by the normal particle velocity, via Euler's equation). The second term is a distribution of dipole secondary sources whose strength is set by the pressure $P$ itself. In words: if you control both the pressure and the velocity everywhere on a closed surface, you can recreate any interior field exactly. This is the rigorous form of Huygens' principle, and it is the theoretical foundation of WFS.

From a closed surface to a single layer: the Rayleigh integrals

Driving every secondary source as a combined monopole and dipole over a fully closed surface is impractical. Two simplifications make WFS buildable. First, if the secondary-source surface is an infinite plane and the listening area lies entirely on one side of it, the problem reduces to one of the two Rayleigh integrals. The Rayleigh I integral synthesises the field using monopole secondary sources only, driven by the normal particle velocity of the desired field on the plane:

P(\mathbf{r}, \omega) = -2 \int_{S} G(\mathbf{r}\,|\,\mathbf{r}_S)\, j\omega\rho\, v_n(\mathbf{r}_S, \omega)\, dS ,

where $\rho$ is the air density and $v_n$ the normal component of the desired particle velocity on the plane. The Rayleigh II integral uses dipole secondary sources driven by the desired pressure. The factor of two appears because, by restricting attention to a half-space, each secondary source effectively radiates into a half-space and we drop one of the two Kirchhoff-Helmholtz terms.

The Rayleigh I form is the one most WFS systems implement, because real loudspeakers — closed cabinets radiating omnidirectionally at low and mid frequencies — behave approximately as monopoles. The practical recipe is therefore: take the desired virtual source field, evaluate its particle velocity normal to the loudspeaker plane, and use that, point by point, as the driving signal for a continuous sheet of monopole loudspeakers. Two more steps remain to reach a real system. We must sample the continuous sheet into discrete loudspeakers (which introduces spatial aliasing, §5), and we must collapse the vertical dimension because we cannot afford a full 2-D plane of drivers (which introduces 2.5D errors, §6).

Real source positions over a large area

The geometric driving function

Let us make the driving function concrete for the case that matters most: a virtual point source at position $\mathbf{r}_s$ emitting a signal $S(\omega)$ , reproduced by a linear array of loudspeakers at positions $\mathbf{r}_n$ along a line. Carrying the Rayleigh I integral through the stationary-phase approximation (valid for frequencies well above the lowest, and listeners not too close to the array) yields a driving function for each loudspeaker of the canonical WFS form:

D_n(\omega) = S(\omega)\, a(\mathbf{r}_n)\, \sqrt{\frac{j k}{2\pi}}\; \frac{\cos\varphi_n}{\sqrt{r_n}}\; e^{-jk r_n},

where $r_n = |\mathbf{r}_n - \mathbf{r}_s|$ is the distance from the virtual source to loudspeaker $n$ , $\varphi_n$ is the angle between the source-to-speaker vector and the array normal, and $a(\mathbf{r}_n)$ is a selection/taper window discussed in §7. Three ingredients carry all the physics:

A delay $e^{-jk r_n}$ , equivalent in the time domain to a propagation delay $\tau_n = r_n / c$ . The loudspeaker nearest the virtual source fires first; those farther along the array fire progressively later, so that the emitted wavelets line up into a single coherent wavefront with the correct curvature.
An amplitude weighting $1/\sqrt{r_n}$ , which controls the level each loudspeaker contributes.
A directivity/obliquity factor $\cos\varphi_n$ , which suppresses loudspeakers seen at a glancing angle from the source, because they contribute little to the desired wavefront.

The term $\sqrt{jk/2\pi}$ is a frequency-dependent filter, rising at $+3\ \text{dB/octave}$ (a "half-differentiator"), that corrects the spectral tilt introduced by the 2.5D geometry and the stationary-phase reduction.

Worked example: the wavefront delays

Take a virtual source $2.0\ \text{m}$ behind a straight array, centred, and consider three loudspeakers: one directly in front of the source (perpendicular distance $2.0\ \text{m}$ ), and two at $\pm 1.5\ \text{m}$ laterally. The distances from the source to these speakers are:

Central speaker: $r_0 = 2.0\ \text{m}$ .
Side speakers: $r = \sqrt{2.0^2 + 1.5^2} = \sqrt{4.0 + 2.25} = \sqrt{6.25} = 2.5\ \text{m}$ .

The propagation delays are $\tau_0 = 2.0 / 343 = 5.83\ \text{ms}$ and $\tau = 2.5 / 343 = 7.29\ \text{ms}$ . In the rendered signal the central speaker therefore fires $7.29 - 5.83 = 1.46\ \text{ms}$ before the side speakers. The relative gains scale as $1/\sqrt{r}$ : taking the central speaker as the reference, the side speakers are weighted by $\sqrt{2.0/2.5} = \sqrt{0.8} = 0.894$ , i.e. $20\log_{10}(0.894) = -0.97\ \text{dB}$ . The obliquity factor for a source behind the centre seen from a side speaker is $\cos\varphi = 2.0/2.5 = 0.8$ , a further $-1.94\ \text{dB}$ . These delays and gains, computed independently for every loudspeaker and every virtual source, are the entire content of a WFS renderer.

Why localisation holds across the whole area

Because the array reproduces the wavefront curvature, the cue that the auditory system uses for direction — the angle of the incoming wavefront at the listener's two ears — is correct wherever the listener stands inside the array's coverage. Contrast this with amplitude panning, where the inter-channel level difference is fixed once and for all and only resolves to the intended angle at the sweet spot (see the geometry in amplitude panning). In WFS there is no single inter-channel ratio; there is a position-dependent set of delays that, by construction, assembles the right wavefront everywhere. A listener walking parallel to the array experiences correct motion parallax: nearby virtual sources sweep past quickly, distant ones drift slowly, exactly as in reality. This robust, walk-around localisation over a large area is the single most important reason to choose WFS.

Virtual sources behind the array and focused sources in front

Sources behind the array (the natural case)

When the virtual source sits behind the loudspeaker array — on the opposite side from the listeners — the geometry is benign. Every loudspeaker lies on the path between the source and the listening area, the propagation delays $\tau_n = r_n/c$ are all positive and causal, and the synthesized wavefront is diverging (convex toward the listeners) just like a real distant source. The special limiting case of a source at infinite distance is a plane wave: all delays collapse onto a single linear ramp across the array (a constant inter-element delay set by the incidence angle), and the obliquity factor and $1/\sqrt{r}$ weighting become uniform. Plane waves are extremely robust and are often used for ambience, music beds, or "infinitely distant" effects.

Focused sources in front of the array

WFS can do something no phantom-image system can: it can place a virtual source in front of the array, inside the room, between the loudspeakers and the listeners. This is a focused source. The idea is to make the array emit a converging wavefront that collapses to a point at the desired focus position and then diverges again beyond it, so that a listener behind the focus point hears a source apparently radiating from that in-room location. A flying insect, a voice that seems to come from the middle of the audience, a helicopter passing overhead through the seating — these are the dramatic effects focused sources enable.

The mathematics is the same Rayleigh-derived driving function, but with a sign change in the delay logic. For a source behind the array the nearest loudspeaker fires first; for a focused source the loudspeaker nearest the focus must fire last, so that all wavelets arrive simultaneously at the focus point. Concretely, if $r_n$ is the distance from loudspeaker $n$ to the focus point $\mathbf{r}_f$ , the driving delay becomes

\tau_n = \frac{r_{\max} - r_n}{c},

where $r_{\max}$ is the largest source-to-speaker distance over the active array, added as a constant to keep all delays causal (non-negative).

Focused sources have a valid zone

Because the wavefront must converge before it reaches the focus, focused sources are only valid in the region behind the focus point (relative to the array); listeners between the array and the focus hear an unnatural converging wavefront and broken localisation. Focused sources are also far more sensitive to spatial aliasing and to array truncation than sources behind the array, because the converging wavefront concentrates errors at the focus.

Worked example: a focused source

Place a focused source $1.2\ \text{m}$ in front of a horizontal array, centred. Consider again a central loudspeaker (perpendicular distance $1.2\ \text{m}$ to the focus) and side speakers at $\pm 1.0\ \text{m}$ lateral offset:

Central speaker to focus: $r_0 = 1.2\ \text{m}$ , time $1.2/343 = 3.50\ \text{ms}$ .
Side speaker to focus: $r = \sqrt{1.2^2 + 1.0^2} = \sqrt{1.44 + 1.00} = \sqrt{2.44} = 1.562\ \text{m}$ , time $1.562/343 = 4.55\ \text{ms}$ .

Now invert the timing so all wavelets meet at the focus. Using $r_{\max} = 1.562\ \text{m}$ as reference, the driving delays are $\tau_0 = (1.562 - 1.2)/343 = 1.06\ \text{ms}$ for the central speaker and $\tau = (1.562 - 1.562)/343 = 0\ \text{ms}$ for the side speakers. The central speaker — closest to the focus — now fires latest, exactly the reverse of the behind-the-array case in §3. The wavelets converge at the focus $1.2\ \text{m}$ inside the room and then diverge toward listeners seated farther back, who localise a source floating in the middle of the room.

Spatial aliasing

Why sampling the array creates aliasing

The Rayleigh integral assumes a continuous sheet of secondary sources. A real array is a discrete set of loudspeakers spaced $\Delta x$ apart, so we are spatially sampling the continuous driving function. Exactly as time sampling at rate $f_s$ aliases any signal above the Nyquist frequency $f_s/2$ , spatial sampling at interval $\Delta x$ aliases any wave whose spatial detail is finer than the sampling can capture. The relevant "spatial frequency" is the trace wavelength of the wavefront along the array. Below a certain frequency the discrete loudspeakers approximate the continuous sheet well and the synthesized wavefront is faithful; above it, the periodic gaps between loudspeakers radiate spurious additional wavefronts — aliased waves arriving from wrong directions and at wrong times.

The aliasing frequency

The condition for alias-free synthesis is that the loudspeaker spacing be smaller than half the trace wavelength of the most oblique wave component being reproduced. In the worst case (waves travelling nearly along the array, or wide listening angles), this gives the often-quoted rule of thumb that the spatial-aliasing frequency is approximately the speed of sound divided by twice the spacing:

f_{\text{alias}} \approx \frac{c}{2\,\Delta x}.

More generally, accounting for the incidence angle $\alpha$ of the virtual wavefront on the array and the listening angle, the limit is

f_{\text{alias}} = \frac{c}{2\,\Delta x\,(\,|\sin\alpha_{\text{ref}}| + |\sin\alpha_{\max}|\,)},

so steeper-incidence and wider listening geometries push $f_{\text{alias}}$ lower than the simple formula. The simple form is the convenient upper bound that engineers quote.

Numeric example: 12 cm spacing

With a typical compact spacing $\Delta x = 0.12\ \text{m}$ :

f_{\text{alias}} \approx \frac{343}{2 \times 0.12} = \frac{343}{0.24} \approx 1430\ \text{Hz}.

Above roughly $1.4\ \text{kHz}$ , this array no longer reconstructs the wavefront faithfully; aliased contributions appear. The following table shows how $f_{\text{alias}}$ scales with spacing, and the number of loudspeakers a $10\ \text{m}$ array would require:

Spacing $\Delta x$	$f_{\text{alias}} \approx c/(2\Delta x)$	Speakers per 10 m
30 cm	572 Hz	34
20 cm	858 Hz	51
12 cm	1430 Hz	84
8 cm	2144 Hz	126
4 cm	4288 Hz	251
2 cm	8575 Hz	501

To push aliasing above the upper limit of hearing ( $\approx 20\ \text{kHz}$ ) at $c = 343\ \text{m/s}$ would require $\Delta x \approx 0.86\ \text{cm}$ — about 1160 drivers across a 10 m front. That is why true full-band WFS is effectively impossible to build, and why every practical system makes a deliberate trade.

What happens above the aliasing frequency, and why it is tolerable

It would seem that aliasing above $1.4\ \text{kHz}$ should ruin WFS, since most of the spectrum lies there. In practice the degradation is far milder than the numbers suggest, for two perceptual reasons rooted in psychoacoustics. First, the primary wavefront — the correct one — still arrives first and at the right time, and the auditory system's localisation is dominated by this leading wavefront (the precedence effect). The aliased contributions arrive slightly later and from spread directions; they degrade timbre (adding a slight roughness or colouration) more than they degrade direction. Second, above a few kilohertz the ear localises mainly by envelopes and level rather than fine waveform structure, so phase errors in the aliased region matter less. The upshot is that a well-designed WFS system with $\Delta x$ around 10-20 cm delivers excellent, stable localisation across a large area, with a residual high-frequency colouration that experienced listeners can hear but most find unobtrusive.

The central engineering bargain

This is the central engineering bargain of WFS: spend as many channels as the budget allows to push $f_{\text{alias}}$ as high as possible, and accept graceful aliasing above it.

2.5D synthesis with horizontal arrays

The dimensional compromise

The Kirchhoff-Helmholtz integral demands a closed surface, or at least an infinite plane, of secondary sources surrounding (or bounding) the listening volume. A genuine 2-D plane of loudspeakers covering a cinema's front wall — hundreds of rows stacked floor to ceiling — is economically and physically out of the question. Real WFS installations therefore use a single horizontal line of loudspeakers at roughly ear height, encircling or fronting the audience. Synthesizing a 3-D field with a 1-D line of sources is called 2.5D synthesis: it correctly renders the horizontal geometry (azimuth, distance, wavefront curvature in the horizontal plane) but cannot reproduce elevation, and it introduces systematic amplitude errors because line sources and point sources have different distance laws.

Why the amplitude is wrong, and the reference line

The physical issue is a mismatch of geometric spreading. A true point (monopole) source's pressure falls off as $1/r$ (the inverse-distance law, discussed under distance and air). A horizontal line of monopoles, by contrast, behaves more like a line/cylindrical source whose pressure falls off as $1/\sqrt{r}$ . When we use a line array to mimic a point source, the radial decay law cannot be matched everywhere simultaneously. The standard fix is to enforce correctness along one chosen locus — the reference line (or reference point) — typically a line running parallel to the array through the centre of the seating area. The 2.5D driving function includes a correction factor of the form

g_{2.5\text{D}} = \sqrt{\frac{r_{\text{ref}}}{\,r_{\text{ref}} + r_n\,}} \quad\text{(schematically)},

chosen so that amplitude is exactly right on the reference line. Closer to the array than the reference line, sources are reproduced slightly too loud; farther away, slightly too quiet.

Localisation survives, loudness drifts

The error grows with distance from the reference line but, crucially, it is a smooth level error, not a localisation error — the direction and timing of the wavefront remain correct everywhere. Localisation, the cue that matters most, is preserved; only absolute loudness drifts by a decibel or two as you move toward or away from the array.

Worked example: the 2.5D level drift

Suppose the reference line is $r_{\text{ref}} = 3\ \text{m}$ from the array and a virtual source is rendered correctly there. Compare a listener at $1.5\ \text{m}$ from the array with one at $6\ \text{m}$ . A true point source's level relative to the reference distance would change by $20\log_{10}(r_{\text{ref}}/r)$ : at $1.5\ \text{m}$ that is $20\log_{10}(3/1.5) = +6.0\ \text{dB}$ , and at $6\ \text{m}$ it is $20\log_{10}(3/6) = -6.0\ \text{dB}$ . A line source's $1/\sqrt{r}$ law instead gives $10\log_{10}(3/1.5) = +3.0\ \text{dB}$ and $10\log_{10}(3/6) = -3.0\ \text{dB}$ . The 2.5D synthesis sits between these laws and is pinned to be correct at $3\ \text{m}$ ; the residual error away from the reference line is the difference between the intended point-source decay and the achieved line-source decay, on the order of $\pm 3\ \text{dB}$ over a 4:1 distance range — audible as a gentle loudness tilt but with localisation intact. This is generally an acceptable price for a horizontal-only array.

Truncation and diffraction effects; tapering

Finite arrays and edge diffraction

The Rayleigh integral assumes an infinite secondary-source plane. Real arrays are finite — they end at the edges of the screen or the walls of the room. Abruptly truncating the integral is mathematically equivalent to multiplying the ideal continuous driving function by a rectangular window, and a rectangular window has consequences. The sharp ends of the array act like the edges of a slit, producing diffraction wavefronts: secondary wavefronts that appear to emanate from the two endpoints of the array. In an impulse response measured off-axis, these show up as distinct, delayed "edge" arrivals after the main wavefront, and they can be audible as a faint echo or as a comb-filter colouration, particularly for virtual sources whose wavefront sweeps obliquely across the array so that one edge is energised strongly.

Tapering as a window

The remedy is borrowed directly from FIR filter and antenna-array design: instead of switching loudspeakers fully on at the array edge, taper their gains smoothly to zero over the last several elements, using the window function $a(\mathbf{r}_n)$ that appeared in the driving function of §3. A raised-cosine (Hann/Tukey) taper over, say, the outermost 10-20% of active loudspeakers softens the truncation, trading a sharp spectral edge-diffraction artefact for a slightly broader main wavefront and a small loss of effective aperture. The taper is also source-dependent: for any given virtual source, only the loudspeakers whose obliquity factor $\cos\varphi_n$ is positive (those "facing" the relevant direction) are active, and the active sub-array is itself tapered at its moving edges. This dynamic windowing — choosing which speakers participate for each source and fading their contribution at the boundaries — is one of the more delicate parts of a WFS renderer, and a place where implementations differ markedly in quality.

The aperture-resolution trade

Tapering interacts with the array's aperture. A larger active aperture (more loudspeakers participating) sharpens the synthesized wavefront and widens the alias-free listening angle, but it also energises the array edges more and demands a longer taper. A narrower aperture is gentler on diffraction but blurs the wavefront and shrinks the usable area. WFS design is thus a continuous negotiation between aperture, taper length, spacing (aliasing), and channel count — there is no single optimum, only a Pareto front set by the room, the budget, and the programme material.

Room interaction and practical scale

WFS wants a dry room

There is a deep tension between WFS and room acoustics. WFS works by reproducing a precisely controlled wavefront; the room's own reflections superimpose additional, uncontrolled wavefronts that the renderer knows nothing about.

Reverberation fights the synthesized field

In a reverberant space these reflections smear the synthesized field, degrade localisation, and partially defeat the whole point of physically reconstructing the wavefront. WFS therefore performs best in fairly dry rooms, where the direct, rendered field dominates and the venue contributes little of its own — much as discussed under reverberation and direct, diffuse and envelopment.

The intended room acoustics — the "virtual" reverberation and envelopment of the reproduced scene — are then synthesized by the renderer itself, often as a set of additional virtual sources (early reflections as image sources, late reverberation as plane waves or many decorrelated sources), giving the engineer full control over the perceived space rather than inheriting the playback room's signature. This is a strength: WFS can render any acoustic, but only if the host room stays out of the way.

Scale, cost, and channel count

The numbers in §5 make the practical reality plain. A modest installation fronting a 10 m wall at 16 cm spacing already needs about 63 independently driven channels; surrounding a medium auditorium with a few hundred loudspeakers is routine for serious WFS. Each loudspeaker needs its own amplifier channel and its own DSP-computed feed, so the rendering load scales as (number of sources) × (number of loudspeakers). The following table summarises the trade-offs against the other techniques in this part of the guide:

Property	Amplitude panning / surround	Ambisonics (order $N$ )	Wave Field Synthesis
What is reproduced	Phantom image	Truncated spherical-harmonic field	Physical wavefront
Valid listening region	Sweet spot	Region $\sim$ growing with $N$	Large area (whole array coverage)
Channels (typical)	2-24	$(N+1)^2$ (e.g. 16 at 3rd order)	tens to hundreds of speakers
Per-feed processing	Gains	Decode matrix	Per-source delay + gain + filter
In-room (focused) sources	No	Limited	Yes
Main limit	Off-sweet-spot collapse	Order-limited spatial resolution	Spatial aliasing, room, cost

The channel count is the dominant practical obstacle. WFS has consequently found its home in installations where the spatial payoff justifies the infrastructure: planetaria, theme-park attractions, immersive art, research auditoria, and high-end cinemas, rather than in the living room. DAM Audio's WFS-X targets exactly these large-array contexts, and the RIPL engine is designed so that the same object-based scene can be rendered to a WFS array, to ambisonics, or to a binaural mix without re-authoring.

History: Berkhout, TU Delft, and the large systems

Berkhout 1988 and the Delft school

WFS was conceived not in a music studio but in a seismic and acoustics laboratory. A. J. Berkhout, professor at the Delft University of Technology (TU Delft) in the Netherlands, came to sound reproduction from seismic wavefield processing, where reconstructing wavefronts from arrays of sensors was standard practice. In his 1988 paper "A holographic approach to acoustic control," Berkhout reframed loudspeaker reproduction as an acoustic holography problem: record or synthesize the wavefront, then reconstruct it with an array of secondary sources. The deliberate analogy to optical holography — recording and replaying a whole wavefront rather than a flat image — gave WFS its early name of "acoustic holography" and its guiding intuition.

Through the 1990s the Delft group turned the principle into engineering. The landmark 1993 paper by Berkhout, de Vries and Vogel, "Acoustic control by wave field synthesis," laid out the loudspeaker driving functions, the treatment of finite arrays, and the practical approximations (2.5D, tapering) that make WFS realisable. Doctoral work by Diemer de Vries, Edwin Start (whose thesis on the application of WFS is a standard reference) and others filled in the theory of array directivity, truncation, and reduced-bandwidth synthesis. By the late 1990s TU Delft had working multi-channel WFS systems and a mature body of theory.

CARROUSO, IOSONO/Fraunhofer, and the Berlin system

European collaboration spread WFS beyond Delft. The EU CARROUSO project (2001-2003) brought together TU Delft, the Fraunhofer Institute for Digital Media Technology (IDMT) in Ilmenau, IRCAM in Paris and others to develop WFS transmission, authoring and rendering, and to integrate it with the then-new MPEG-4 object-based audio framework. Out of Fraunhofer IDMT came IOSONO, a company that commercialised WFS for cinemas and theme parks, deploying systems with hundreds of loudspeakers.

The most celebrated research installation is at the Technische Universität Berlin, whose large lecture hall (the Hörsaal H 0104) was fitted with a WFS array of more than 2700 individually driven loudspeakers around the room — for years the largest WFS system in the world and a workhorse for perceptual and technical studies. Around the same body of work, Sascha Spors, Rudolf Rabenstein, Jens Ahrens and colleagues advanced the analytic theory: the relationship between WFS and Ambisonics, the spectral analysis of spatial aliasing, and the unifying framework of sound field synthesis that treats both methods as solutions to the same boundary-value problem. Ahrens' book Analytic Methods of Sound Field Synthesis (2012) is the modern theoretical capstone, deriving driving functions for arbitrary array geometries in closed form.

WFS as a delay engine, and unifying it with amplitude panning

Two ways to steer a wavefront

Step back and compare the two families of loudspeaker spatialization at the level of what physical quantity does the steering. Amplitude panning steers a phantom image by level differences between a small number of loudspeakers, with all loudspeakers radiating the same signal at the same time (no relative delay). WFS steers a real wavefront primarily by time differences — the per-loudspeaker delays $\tau_n = r_n/c$ — with level (the $1/\sqrt{r}$ and obliquity weights) playing a secondary, corrective role.

The deepest structural distinction

In slogan form: amplitude panning is gain-based; WFS is delay-based. This is the deepest structural distinction between them, and it explains their complementary strengths. Pure gain differences produce a robust perceived direction at the sweet spot but no physical wavefront curvature, so they fail off-centre. Delays produce genuine wavefront geometry that holds across a large area but require many closely spaced drivers to avoid aliasing.

A single driving-function engine

The two methods are not really separate algorithms; they are points on a continuum, and a well-designed renderer can implement both with one signal path. Every loudspeaker feed, in the most general object renderer, is

y_n(t) = \sum_{i} g_{n,i}\; x_i\!\left(t - \tau_{n,i}\right) \ast h_{n,i}(t),

a sum over virtual sources $i$ of the source signal $x_i$ , delayed by $\tau_{n,i}$ , scaled by gain $g_{n,i}$ , and optionally filtered by $h_{n,i}$ (the $+3\ \text{dB/oct}$ WFS pre-filter, distance air-absorption, etc.). Set the delays $\tau_{n,i}$ to the WFS propagation delays and the gains to the WFS weights, and the engine performs Wave Field Synthesis. Force all delays to zero, $\tau_{n,i} = 0$ , restrict to two or three loudspeakers, and choose the gains by a panning law (tangent law, VBAP, or HSR multi-speaker upmixing), and the same engine performs amplitude panning. Vector Base Amplitude Panning and WFS thus differ only in which terms are active, not in the architecture.

Hybrid and graceful-degradation strategies

This unification has real engineering value. A renderer can blend the two regimes: use full delay-based WFS up to the spatial-aliasing frequency, where it excels, and fall back to gain-based, time-aligned panning above it, where delays would only feed aliasing — an approach sometimes called Optimised Phantom Source Imaging or hybrid WFS. It can also degrade gracefully as a venue's array thins out: a dense front array rendered with full WFS, surround positions filled by amplitude panning across sparser loudspeakers, all from one object-based scene description. Because the scene is authored as objects with positions — the same abstraction as object-based audio — the choice of WFS versus panning versus ambisonic decode becomes a rendering decision made for the array actually installed, not an authoring decision baked into the content. DAM Audio's RIPL is built around exactly this separation of authoring from rendering.

Connecting back to perception

Whether the engine steers by delay or by gain, the target is always the listener's two ears and the cues the auditory system extracts: interaural time difference (ITD), interaural level difference (ILD), and the spectral and curvature cues for distance and elevation. The recurring theme of this whole part of the guide — encode a field into a finite representation, then decode it to the system present — applies to WFS as cleanly as to any other method. The encode step is the authoring of virtual sources (position, signal, directivity, intended room) into the object scene. The decode step is the evaluation of the Rayleigh-derived driving function for the specific loudspeaker array installed: which speakers are active, with what delay, gain and filter. WFS differs from ambisonics or stereo only in that its decode aims to reconstruct the physical wavefront over a large area rather than to optimise a perceptual approximation at a point — and pays for that ambition in loudspeakers.

Limits

It is worth collecting WFS's boundaries in one place, because they define when the method is and is not the right tool.

Spatial aliasing caps faithful synthesis at $f_{\text{alias}} \approx c/(2\Delta x)$ ; above it, timbre degrades even though localisation largely survives. No affordable spacing reaches full audio bandwidth.
2.5D amplitude error is inherent to horizontal-only arrays: levels are correct only on the reference line and drift by a few decibels elsewhere, and elevation cannot be reproduced at all without a 2-D array.
Truncation and diffraction from finite arrays add edge wavefronts, partly tamed by tapering at the cost of aperture.
Room reverberation competes with the synthesized field; WFS needs a dry venue and synthesizes its own acoustics.
Focused sources are valid only behind the focus, are aliasing-sensitive, and break down for listeners between array and focus.
Cost and channel count — tens to hundreds of amplified, individually processed loudspeakers — confine WFS to installations, not consumer playback.
Near-field and very-low-frequency behaviour departs from the stationary-phase approximations used to derive the simple driving functions, requiring more careful treatment close to the array and at the lowest frequencies (where wavelengths exceed the array, true wavefront control is impossible).

None of these negates WFS's defining virtue: within its valid band and area, it delivers stable, walk-around, physically correct localisation that no phantom-image method can match. The art of WFS engineering is choosing spacing, aperture, taper, reference geometry and hybrid fallbacks so that these limits land where the programme and the audience will least notice them.

Common mistakes and pitfalls

A few recurring errors separate disappointing WFS installations from convincing ones.

Common mistake

Over-spacing the array is the most common: stretching the budget over too wide a front with 30-40 cm spacing drops $f_{\text{alias}}$ below 600 Hz, audibly colouring everything; it is usually better to cover a smaller width densely than a large width sparsely.

Ignoring the reference line when setting levels leaves the mix correct only in a band of seats and too loud or too soft elsewhere; the reference geometry must be chosen for the actual audience area. Abrupt array truncation without tapering produces edge-diffraction echoes that engineers often misdiagnose as room reflections. Deploying WFS in a live, reverberant room and then wondering why localisation is mushy inverts the method's requirement for a dry host space. Pushing focused sources too far in front of the array, or seating listeners between array and focus, breaks the converging-wavefront geometry and collapses the effect. And treating WFS as a channel format rather than an object renderer forfeits its main advantage: WFS content should be authored as positioned objects and rendered to the installed array, never mixed as fixed loudspeaker feeds. Avoiding these six mistakes accounts for most of the difference between a WFS system that merely works and one that genuinely dissolves the boundary between the array and the virtual scene.

References

Berkhout, A. J. (1988). "A holographic approach to acoustic control." Journal of the Audio Engineering Society, 36(12), 977-995.
Berkhout, A. J., de Vries, D., & Vogel, P. (1993). "Acoustic control by wave field synthesis." Journal of the Acoustical Society of America, 93(5), 2764-2778.
Start, E. W. (1997). Direct Sound Enhancement by Wave Field Synthesis. Doctoral dissertation, Delft University of Technology.
Spors, S., Rabenstein, R., & Ahrens, J. (2008). "The theory of wave field synthesis revisited." 124th AES Convention, Amsterdam, preprint 7358.
Ahrens, J. (2012). Analytic Methods of Sound Field Synthesis. Springer, Berlin/Heidelberg.
Verheijen, E. N. G. (1997). Sound Reproduction by Wave Field Synthesis. Doctoral dissertation, Delft University of Technology.
Pulkki, V. (1997). "Virtual sound source positioning using vector base amplitude panning." Journal of the Audio Engineering Society, 45(6), 456-466.
Daniel, J., Nicol, R., & Moreau, S. (2003). "Further investigations of high-order Ambisonics and wavefield synthesis for holophonic sound imaging." 114th AES Convention, Amsterdam, preprint 5788.
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised ed.). MIT Press, Cambridge, MA.

← Back to Spatialization Techniques

The distinguishing idea: reconstruct the wavefront, not the image​

Phantom image versus physical field​

Why this matters: from sweet spot to sweet area​

The physics: Huygens, Kirchhoff-Helmholtz, and secondary sources​

Huygens-Fresnel: a wavefront is made of sources​

The Kirchhoff-Helmholtz integral​

From a closed surface to a single layer: the Rayleigh integrals​

Real source positions over a large area​

The geometric driving function​

Worked example: the wavefront delays​

Why localisation holds across the whole area​

Virtual sources behind the array and focused sources in front​

Sources behind the array (the natural case)​

Focused sources in front of the array​

Worked example: a focused source​

Spatial aliasing​

Why sampling the array creates aliasing​

The aliasing frequency​

Numeric example: 12 cm spacing​

What happens above the aliasing frequency, and why it is tolerable​

2.5D synthesis with horizontal arrays​

The dimensional compromise​

Why the amplitude is wrong, and the reference line​

Worked example: the 2.5D level drift​

Truncation and diffraction effects; tapering​

Finite arrays and edge diffraction​

Tapering as a window​

The aperture-resolution trade​

Room interaction and practical scale​

WFS wants a dry room​

Scale, cost, and channel count​

History: Berkhout, TU Delft, and the large systems​

Berkhout 1988 and the Delft school​

CARROUSO, IOSONO/Fraunhofer, and the Berlin system​

WFS as a delay engine, and unifying it with amplitude panning​

Two ways to steer a wavefront​

A single driving-function engine​

Hybrid and graceful-degradation strategies​

Connecting back to perception​

Limits​

Common mistakes and pitfalls​

References​