The Doppler Effect & Moving Sources

A static source at a fixed angle is, in a sense, the easy case. The moment a virtual source begins to move — an arrow whistling past, a car sweeping across the stage, a helicopter crossing the dome overhead — the spatializer is asked to reproduce a cluster of cues that the ear treats as a single, indivisible impression of motion. Level rises and falls, the interaural cues sweep across the head, the air progressively dulls the high end, and the pitch of the source bends downward as it passes. That last cue is the Doppler effect, and it is the subject of this chapter.

Doppler is special among motion cues because it is not a separate "effect" to be bolted on. It is the automatic consequence of doing distance rendering honestly: if a spatializer models the time it takes sound to travel from a moving source to the listener, the pitch shift falls out for free. Conversely, if a spatializer ignores propagation time and simply ramps a gain, no amount of clever EQ will make a fast pass feel real. The recurring theme of this guide — that a spatializer must reproduce the perceptual cues, not merely set an angle — is nowhere sharper than here. The brain has a lifetime of experience correlating approaching objects with rising pitch and energy. Violate that correlation and the scene feels synthetic, no matter how accurate the panning.

This chapter builds the picture from first principles: the everyday phenomenon, the underlying physics (and the crucial fact that only radial velocity matters), why Doppler is such a strong realism cue, how it is implemented with a variable fractional delay line, the interpolation mathematics that make or break the result, the artefacts that arise and how to defeat them, the editorial question of when to use it at all, and a fully worked aircraft flyover. Throughout, the emphasis is on understanding why each mechanism behaves as it does.

The Phenomenon: Everyday Experience and Wavefronts

What you hear

Everyone has heard it. An ambulance approaches with its siren held on a steady note; as it passes and recedes, the pitch drops audibly — not gradually over many seconds, but with a recognisable downward bend concentrated around the moment of closest approach. A racing car at a circuit does the same thing on a grander scale, the engine note swooping from a high snarl to a lower growl as it flashes past the grandstand. A train horn, a low-flying aircraft, even a fast-thrown object close to the ear: all exhibit the same signature.

The key perceptual observations, which any good implementation must honour, are these. First, the shift is continuous while the source moves, but its rate of change is greatest near the closest point. Second, the pitch is shifted upward during approach and downward during recession, with the transition passing through the source's true (unshifted) pitch at the instant of closest approach. Third, the total amount of shift depends on how fast the source moves relative to the speed of sound — a pedestrian's footsteps produce no audible Doppler, a jet produces a dramatic one.

Wavefront compression and expansion

The physical picture is wavefronts. Imagine the source emitting spherical pressure crests at a fixed rate — one crest every $T$ seconds, where $T$ is the period of the emitted tone. If the source is stationary, these crests form concentric spheres, evenly spaced by one wavelength in every direction. The listener intercepts crests at exactly the emission rate, and hears the true frequency.

Now let the source move. Between emitting one crest and the next, the source has travelled a small distance. Each successive crest is therefore launched from a point slightly closer to a listener who is ahead of the source, and slightly farther from a listener behind it. Ahead of the source, the crests pile up: they are spaced more closely than one wavelength, so the listener intercepts them at a higher rate, hearing a raised pitch. Behind the source, the crests spread apart, are intercepted at a lower rate, and the pitch falls.

It is worth stressing what is not happening. The source has not changed the note it is playing; the emission period $T$ is constant. The medium has not changed the speed of sound. What changes is the geometry of arrival — the distance the sound must travel is shrinking or growing, and a shrinking distance delivers crests early while a growing distance delivers them late. This "delivery timing" framing is exactly the one we will exploit when we implement Doppler as a variable delay: the pitch shift is a side effect of a continuously changing travel time, nothing more.

Rule of thumb

A useful first intuition for magnitude: the fractional change in pitch is, to good approximation, the source's speed-toward-you divided by the speed of sound. A source approaching at $34.3$ m/s (about $123$ km/h), against a sound speed of $343$ m/s, compresses wavelengths by roughly ten percent, raising a $1000$ Hz tone to about $1100$ Hz on approach. We will make this precise next.

The Physics: Speed of Sound, Geometry, and Radial Velocity

The speed of sound and its temperature dependence

Doppler is a ratio of velocities, and the reference velocity is the speed of sound in air, $c$ . In dry air at $20$ °C, $c$ is approximately $343$ m/s. It is not a universal constant: it depends almost entirely on temperature (and very weakly on humidity and pressure for normal conditions). A convenient working relationship is

$c \approx 331.3 + 0.606 \cdot \theta$

(metres per second, with $\theta$ in degrees Celsius).

So at $0$ °C, $c \approx 331$ m/s; at $20$ °C, $c \approx 343$ m/s; at $35$ °C on a hot stage, $c \approx 352$ m/s. The variation across a realistic range is a few percent. For Doppler this matters only at the margins — a few percent change in $c$ changes the computed shift by a few percent — but for a system that also renders propagation delay for time alignment (see calibration, discussed in Part V), getting $c$ right to the temperature is more consequential, because a 3 percent error in $c$ is a 3 percent error in every modelled delay.

For the rest of this chapter we use $c = 343$ m/s unless stated otherwise.

Three cases: source moving, listener moving, both

The Doppler shift depends on who is moving relative to the medium. There are three textbook cases, and a spatializer may need any of them — a moving virtual source in a static scene, a static source heard from a moving listener (a first-person flythrough), or both at once (a chase).

Let $f_0$ be the emitted frequency and $f$ the received frequency. Let $v_s$ be the speed of the source along the line toward the listener (positive when approaching), and $v_l$ the speed of the listener along the line toward the source (positive when approaching). The general relationship is

$f = f_0 \cdot \frac{c + v_l}{c - v_s}.$

Read the structure carefully, because the asymmetry is physical, not a quirk of notation:

Source moving, listener still ( $v_l = 0$ ): $f = f_0 \cdot \dfrac{c}{c - v_s}$ . The moving source compresses the wavelengths in the medium itself; the denominator does the work.
Listener moving, source still ( $v_s = 0$ ): $f = f_0 \cdot \dfrac{c + v_l}{c}$ . The wavelengths in the medium are unchanged; the listener simply sweeps through them faster or slower; the numerator does the work.
Both moving: the full expression.

For everyday speeds ( $v_s$ , $v_l$ both far below $c$ ) the two single-mover cases give almost identical results — the difference only becomes significant as speeds approach $c$ . A source approaching at $30$ m/s raises a $1000$ Hz tone to $1096$ Hz (source-moving) versus $1087$ Hz (listener-moving) — close, but not identical. At supersonic speeds the source-moving denominator goes to zero and then negative, which is the mathematics announcing a shock front; that regime is outside normal audio rendering.

Only radial velocity matters

Most often misunderstood

This is the single most important geometric fact for implementing Doppler, and the one most often misunderstood. The frequency shift depends only on the rate at which the source–listener distance changes — the radial (line-of-sight) component of velocity. The tangential component — motion across your field of view, neither toward nor away — contributes zero shift at the instant it is purely tangential.

The physical reason is direct from the wavefront picture: pitch shift comes from crests being delivered early or late, which happens only when the distance is shrinking or growing. A source sliding past at constant distance (a perfect circle around your head) delivers every crest at the same travel time and produces no Doppler at all, despite moving at full speed.

Decompose the velocity. If the source moves with speed $v$ and the angle between its velocity vector and the line of sight to the listener is $\varphi$ , then the radial speed is $v_r = v \cdot \cos(\varphi)$ . At closest approach $\varphi = 90^\circ$ , $\cos(\varphi) = 0$ , and $v_r = 0$ — the instantaneous shift is zero. This is precisely why the pitch passes through the true value at the moment of closest approach. Long before closest approach, the source is almost heading straight at you, $\varphi$ is small, $\cos(\varphi) \approx 1$ , and you get nearly the full approaching shift. Long after, $\varphi \approx 180^\circ$ , $\cos(\varphi) \approx -1$ , and you get nearly the full receding shift. The transition between them is governed by how $\cos(\varphi)$ swings from $+1$ through $0$ to $-1$ — and how fast it swings depends on the miss distance, as we will quantify in the flyover example.

A numeric example: a 60 km/h pass

Take a car sounding a horn at $f_0 = 500$ Hz, driving in a straight line on a road, and a listener standing some distance off to the side. The car's speed is $60$ km/h $= 16.67$ m/s. Use $c = 343$ m/s, source-moving case.

Far up the road, while the car is still essentially heading toward the listener ( $\varphi \approx 0$ , $v_r \approx +16.67$ m/s):

$f = 500 \cdot \frac{343}{343 - 16.67} = 500 \cdot \frac{343}{326.33} = 525.5 \text{ Hz}.$

Far down the road, receding ( $v_r \approx -16.67$ m/s):

$f = 500 \cdot \frac{343}{343 + 16.67} = 500 \cdot \frac{343}{359.67} = 476.8 \text{ Hz}.$

So a listener hears the horn swing from about $525$ Hz down to about $477$ Hz — a total drop of roughly $49$ Hz, or about $1.7$ semitones (the ratio $525.5/476.8 = 1.102$ , and $12 \cdot \log_2(1.102) \approx 1.68$ semitones). That is a clearly audible, musically significant bend, and it is concentrated around the moment the car draws level. At the instant of closest approach the radial velocity is zero and the listener momentarily hears the true $500$ Hz.

Note the slight asymmetry baked into the source-moving formula: the upward shift ( $+25.5$ Hz) is a touch larger than the downward shift ( $-23.2$ Hz), because the denominator $(c - v_s)$ is more sensitive than $(c + v_s)$ . For most audio purposes this asymmetry is negligible, but it is real and a physically correct renderer reproduces it automatically.

Why Doppler Is a Strong Motion Cue

Doppler as a velocity estimator the brain trusts

The auditory system is extraordinarily good at extracting motion from sound, and Doppler is one of its richest inputs. Unlike a level change — which is ambiguous, since a sound can get louder by approaching or by simply being turned up — a pitch glide that bends downward through a stable centre frequency is an almost unambiguous signature of an object passing by. The brain reads the sign and rate of the glide as a direct report of radial velocity. Combined with the binaural sweep (the source crossing from one side to the other) and the level swell, it produces an irresistible impression of a physical object with mass and speed.

Crucially, Doppler conveys information the other cues cannot. Two passes can have identical level envelopes and identical panning trajectories yet feel utterly different because one has the pitch bend and one does not. Listeners describe the Doppler-less version as "flat", "fake", or "sliding", even when they cannot name what is missing. This is the hallmark of a perceptual cue: it operates below the level of conscious description, but its absence is instantly detectable.

Interaction with level and panning during a pass

Doppler never acts alone. During a real pass, three cues evolve together in a tightly correlated way, and their correlation is itself a cue:

Level follows the inverse-distance law (see distance and air): roughly $-6$ dB per doubling of distance, peaking at closest approach.
Panning / interaural cues sweep through the source's angular position, fastest near closest approach (the angle changes most rapidly when the source is nearest).
Pitch is highest during approach, passes through true pitch at closest approach, and is lowest during recession.

Key takeaway

The decisive perceptual fact is timing. The level peak, the pan centre-crossing, and the Doppler zero-crossing all coincide at the same instant — the moment of closest approach. The brain expects this synchrony. If a sound designer ramps the gain by hand, sweeps the pan by hand, and adds a pitch bend by hand, any misalignment between the three — the pitch bottoming out a half-second after the level peak, say — reads as wrong even if no listener can articulate why.

The great advantage of a physically modelled renderer is that all three cues are computed from the same moving geometry, so they are automatically synchronous. We will see in the implementation section that this is not a happy accident but a structural property: level, delay (hence pitch), and pan all derive from the same source-to-listener vector at each instant.

One subtlety worth flagging: near closest approach, the angular rate is highest (fast pan) at exactly the moment the Doppler rate of change is also highest (fast pitch bend) but the Doppler magnitude is zero. The pitch is changing fastest precisely when it is passing through the unshifted value. This is the auditory "whoop" of a close fast pass, and reproducing the steep slope there is what separates a convincing flyby from a lazy one.

Implementing Doppler: The Variable Fractional Delay Line

The central insight: delay is Doppler

Here is the idea that makes everything practical, and it is beautiful in its economy. You do not compute a frequency shift and apply a pitch-shifting algorithm. Instead, you model the propagation delay — the time $\tau = \text{distance}/c$ for sound to travel from source to listener — and you let it change as the source moves. A changing delay, applied to an audio stream, produces a pitch shift automatically, with no spectral processing of any kind.

Why? Consider a source whose distance to the listener is shrinking. At time $t_1$ the propagation delay is $\tau_1$ ; a moment later at $t_2$ it is $\tau_2$ , with $\tau_2 < \tau_1$ because the source got closer. The sound the listener receives at output time $t$ was emitted at source time $t - \tau(t)$ . As $\tau$ decreases with time, the listener is being fed samples from progressively later points in the source signal — the playback is reading through the source faster than real time. Reading a signal faster raises its pitch. When $\tau$ increases (source receding), playback reads slower than real time, lowering the pitch. The pitch shift is exactly the rate of change of the delay.

We can make this precise. If the source signal is $x(t)$ and the delay is $\tau(t)$ , the output is $y(t) = x(t - \tau(t))$ . The instantaneous playback-rate factor is $1 - d\tau/dt$ . Since $\tau = r/c$ where $r$ is the source–listener distance, $d\tau/dt = (1/c) \cdot dr/dt = v_{r,\text{recede}}/c$ , where $v_{r,\text{recede}}$ is the rate at which distance is growing (positive when receding). The received frequency is therefore

$f = f_0 \cdot \left(1 - \frac{v_{r,\text{recede}}}{c}\right) = f_0 \cdot \left(1 + \frac{v_{r,\text{approach}}}{c}\right).$

This is the listener-moving form of the Doppler equation, and it is what a simple variable-delay renderer produces. (The exact source-moving form $c/(c - v_s)$ differs only at second order in $v_s/c$ ; a renderer that updates $\tau$ from the true instantaneous distance every sample reproduces the correct shift to the accuracy of that distance model. The small approximation is almost always inaudible.)

The headline

Model the delay honestly and the Doppler is correct by construction.

The read/write pointer picture

Concretely, a delay line is a circular buffer. A write pointer advances by one sample every sample period, depositing the incoming source signal into the buffer. A read pointer trails the write pointer by $D = \tau \cdot f_s$ samples, where $f_s$ is the sample rate; whatever it reads is the output.

When the source moves closer, $\tau$ shrinks, so the desired read delay $D$ shrinks: the read pointer must close the gap on the write pointer, which means it advances by more than one sample per sample on average — it reads through the buffer faster, raising pitch. When the source recedes, $D$ grows, the read pointer falls further behind, advancing by less than one sample per sample, lowering pitch. The write pointer always marches at exactly one sample per sample (that is just the incoming audio); all the Doppler action is in the time-varying gap between the two pointers.

A numeric feel for the magnitudes. At $f_s = 48\,000$ Hz and $c = 343$ m/s, one metre of distance is $D = (1/343) \cdot 48000 \approx 140$ samples of delay. A source approaching at $30$ m/s closes $30$ metres of distance per second, i.e. $30 \cdot 140 = 4200$ samples of delay removed per second. Spread over $48\,000$ output samples, that is $4200/48000 = 0.0875$ of a sample removed per output sample — so the read pointer advances $1.0875$ samples per output sample, a playback rate of $1.0875$ , i.e. a pitch shift of $+1.0875$ (about $+1.45$ semitones, matching $30/343 = 0.0875$ exactly). The fractional part — that $0.0875$ of a sample — is why we need fractional delay: the read pointer almost never lands on an integer sample index, and we must interpolate between stored samples. That interpolation is the subject of the next section, and it is where audible quality is won or lost.

Fractional-Delay Interpolation in Depth

Why fractional delay is unavoidable

The read pointer position $D = \tau \cdot f_s$ is a real number, almost never an integer. To produce an output sample we must estimate the buffer's value between two stored samples — at index $n + \text{frac}$ , where $\text{frac}$ is between $0$ and $1$ . The quality of that estimate determines the timbre of every moving source. A bad interpolator dulls high frequencies, adds intermodulation, or — worst — produces audible clicks as the fractional part wraps. Three families dominate: linear, all-pass, and polynomial/sinc (Lagrange, windowed sinc).

Linear interpolation

Linear interpolation reads $y = (1 - \text{frac}) \cdot \text{buf}[n] + \text{frac} \cdot \text{buf}[n+1]$ . It is one multiply-add, trivially cheap, and unconditionally stable. Its defect is a frequency-dependent attenuation: linear interpolation acts as a gentle low-pass filter whose cut depends on $\text{frac}$ . At $\text{frac} = 0.5$ (the worst case) it attenuates the Nyquist frequency completely and rolls off the top octave noticeably; at $\text{frac} = 0$ or $1$ it is exact. Because $\text{frac}$ changes continuously as the source moves, this means the high-frequency loss modulates with the motion — a subtle "breathing" of the treble that, on bright sources like cymbals or engine rasp, can be audible. For many game and ambience uses it is perfectly acceptable; for critical music material it is marginal.

A numeric sense of the damage: at $\text{frac} = 0.5$ , a $12$ kHz component at $f_s = 48$ kHz (one quarter of $f_s$ ) is attenuated by roughly $3$ dB; a component at $f_s/2$ is killed. The roll-off is mild in the audible midband but real at the top.

All-pass interpolation

A first-order all-pass filter implements fractional delay with — in principle — flat magnitude response at all frequencies, trading the linear interpolator's amplitude error for a frequency-dependent phase (group-delay) error. This is attractive because the ear is far more tolerant of phase error than of treble loss. The catch is transient response: the all-pass has memory (it is recursive), so when the delay changes rapidly its internal state must "catch up", producing transient ringing or chirping during fast delay changes. It also requires care to avoid discontinuities when the integer part of the delay jumps. All-pass interpolation shines for slowly varying delays (slow Dopplers, modulated delays in reverberation and chorus) and is less ideal for the violent delay changes of a close fast pass.

Polynomial and sinc: Lagrange and windowed sinc

Higher-order interpolators fit a polynomial (Lagrange) or a truncated, windowed ideal-reconstruction kernel (sinc) through several surrounding samples. A 3rd- or 4th-order Lagrange interpolator uses four or five taps and dramatically reduces the high-frequency amplitude error of linear interpolation, at a few more multiply-adds per sample. A windowed-sinc interpolator with, say, 8–32 taps approaches the ideal fractional delay (flat magnitude, linear phase) to whatever accuracy you pay for, and is the choice for high-end music spatialisers and sample-rate conversion. The cost is CPU: an $N$ -tap interpolator is $N$ multiply-adds per source per sample, which multiplies across a scene of dozens of moving objects.

The table summarises the trade-offs.

Interpolator	Taps / cost	Magnitude error	Phase error	Behaviour under fast delay change	Typical use
Linear	2, trivial	High at top octave, modulates with $\text{frac}$	Low	Stable, but treble "breathes"	Games, ambience, many-object scenes
First-order all-pass	~2, recursive	Flat (ideal)	Frequency-dependent	Transient ringing if delay changes fast	Slow Doppler, modulated delays, reverb
Lagrange (3rd–4th order)	4–5	Low	Low, near-linear	Stable, clean	General-purpose quality Doppler
Windowed sinc (8–32 tap)	8–32	Negligible	Near-ideal linear	Stable, cleanest	Critical music, mastering-grade renderers

A practical renderer often switches interpolators by context: linear for quiet distant objects, Lagrange or sinc for the loud, fast, foreground object the listener is tracking. The DAM Audio spatialiser RIPL takes the position that the foreground moving object deserves the expensive interpolator and the background does not, allocating CPU where the ear is listening.

Artefacts and Their Mitigation

Even with a good interpolator, a naive variable-delay Doppler can sound broken. The artefacts have specific causes and specific cures, and understanding them is the difference between a renderer that flies and one that buzzes.

Zipper noise

Cause. If the delay value $D$ is updated only once per audio block (say every 64 or 128 samples) and held constant within the block, then stepped between blocks, the read pointer jumps discontinuously at each block boundary. Each jump is a small waveform discontinuity, and a regular train of them at the block rate produces a buzzing tone — "zipper noise" — whose pitch is the block rate (e.g. $48000/128 = 375$ Hz). It is most audible on sustained tonal material and during fast motion (large jumps).

Most important detail

Cure. Per-sample (or heavily oversampled) interpolation of the delay value itself. Rather than holding $D$ constant across the block, ramp it smoothly from its old value to its new value across the block's samples, so the read pointer moves continuously. This is the single most important implementation detail in a Doppler renderer: smooth the control (the delay trajectory), not just the signal (the buffer interpolation). The two are different — sample interpolation handles the fractional read; control smoothing handles the evolution of where to read.

Clicks and discontinuities

Cause. Two common sources. First, when the integer part of the delay changes and the read pointer crosses a sample boundary, a poorly written interpolator can produce a one-sample glitch. Second, and more damaging, when a source's trajectory is defined by automation breakpoints, the velocity (the slope of distance versus time) can jump instantaneously at a breakpoint — a corner in the position curve. A corner in position is a step in velocity, hence a step in pitch, which the ear hears as a click or a sudden pitch jolt.

Cure. Ensure $C^1$ continuity (continuous position and velocity) of the trajectory. Automation curves should be smoothed so that velocity does not step. Many renderers low-pass the position signal, or fit a spline, so that breakpoints become smooth curves rather than corners. The interpolator must also handle integer-boundary crossings without discontinuity — Lagrange and sinc do this naturally because they read a window that slides smoothly.

Transient smearing

Cause. When the delay changes very fast (a close, high-speed pass), the read pointer can sweep across many samples per output sample. If a sharp transient — a drum hit, a gunshot — happens to be read during a steep delay ramp, it is stretched or compressed in time, smearing its attack. With a long interpolation kernel (windowed sinc), an extreme delay rate can even read the same transient region twice or skip it, producing a flam or a softened attack.

Cure. Mostly this is physically correct — a real transient emitted by a fast-moving source genuinely arrives time-warped — but excessive smearing from kernel length can be mitigated by capping the delay rate, by reducing kernel order during the fastest portions of a pass, or by accepting it as the price of physical accuracy. In practice the smearing of a fast pass is part of what makes it sound fast, so the goal is to keep it clean (no aliasing, no double-reads) rather than to eliminate it.

Trajectory smoothing as the master control

The thread running through all three artefacts is trajectory smoothness. The distance signal $r(t)$ must be continuous in value and slope; the delay $D(t) = r(t) \cdot f_s / c$ inherits that smoothness; the read pointer moves cleanly; and the artefacts vanish. A spatialiser should therefore treat the geometry update path — how often position is updated and how it is interpolated between updates — as a first-class signal-processing problem, not an afterthought. In game and VR engines, where positions may arrive at the visual frame rate (60–120 Hz) rather than the audio rate, this means interpolating positions up to (at least) the block rate and smoothing them, before they ever touch the delay line.

When to Apply, Attenuate, or Disable Doppler

Physical correctness is not always the goal. Doppler is a tool, and like any tool it is right for some jobs and wrong for others. A mature renderer exposes Doppler as a controllable amount, per object, and a mature engineer knows when to dial it down.

Sound design and flythroughs: full Doppler

For anything where the physical fact of motion is the point — vehicle passes, projectiles, flybys, sci-fi craft, first-person VR and game flythroughs — Doppler should be on and accurate. Here the pitch bend is doing essential storytelling: it tells the listener how fast, how close, and in which direction. Underplaying it makes fast things feel slow and heavy things feel weightless. In automotive audio design, where engine-pass realism is scrutinised by people who know exactly what a real car sounds like going by, full physical Doppler (often layered on a granular or sample-based engine model) is the baseline expectation.

Music and dialogue: attenuate or disable

The opposite pole is pitched musical content and speech.

Detuning danger

If a vocalist is panned on a moving trajectory for an immersive mix, full Doppler will detune the voice — bending it sharp on approach and flat on recession — which is musically intolerable. The same applies to any harmonic instrument: a moving piano that goes out of tune as it moves is a defect, not a feature. For dialogue, Doppler on a moving character can pull the voice off pitch enough to sound seasick.

The standard practice is to disable or heavily attenuate Doppler on tonal foreground content while keeping the other motion cues (level, pan, air, reverb) fully active. The source still clearly moves; it simply does not retune.

This is the cleanest illustration of the chapter's theme. Motion is a bundle of cues, and you can choose which members of the bundle to engage. For a sound effect you want all of them; for a singing voice you want every motion cue except the one that changes pitch.

The "Doppler amount" control and per-object choices

The practical interface is a Doppler amount parameter, typically 0 to 100 percent (or 0 to 1, sometimes higher than 1 for exaggeration), that scales the effective radial velocity fed to the delay-rate computation. At 100 percent the renderer is physically accurate. At 50 percent the pitch swing is halved — a useful compromise for content that is "mostly musical but should feel like it moves". At 0 percent the delay line still renders the static propagation delay (so distance and time alignment are preserved) but the rate of change is frozen out, eliminating pitch shift entirely. Note the subtlety: turning Doppler "off" should not mean removing the delay; it should mean stopping the delay from changing pitch, which is usually implemented by smoothing the delay so aggressively that its rate of change cannot reach audible frequencies, or by computing the output assuming zero radial velocity.

Per-object control is essential because a single immersive scene routinely mixes both kinds of content: a dialogue stem (Doppler off), a passing vehicle effect (Doppler full), and a music bed (Doppler off) may all be live at once. The renderer must let each object carry its own Doppler-amount setting. The object-based workflow makes this natural, since each object is an independent entity with its own metadata; a channel-bed or Ambisonic submix, by contrast, has already baked its spatialisation and is usually left Doppler-free.

Content type	Recommended Doppler	Rationale
Vehicle / projectile / flyby SFX	Full (100%)	Pitch bend is the realism cue
First-person VR / game flythrough	Full	Listener motion must feel physical
Foley, footsteps, impacts (moving)	Full but often small (low speed → small shift)	Physically correct, rarely intrusive
Sung vocal / harmonic instrument (moving)	Off or low (0–25%)	Avoid audible detuning
Dialogue	Off	Avoid pitch instability on speech
Music bed / pre-rendered stems	Off	Already mixed; would detune
Stylised / hyperreal SFX	Above 100%	Deliberate exaggeration for impact

Propagation Delay, Distance Rendering, and Doppler: One Delay Line

It is worth making explicit that Doppler and distance rendering are not two systems but two readings of one. The same variable delay line that produces Doppler also produces the propagation delay that aligns a source in time and supports distance perception.

Static propagation delay

Even a stationary source has a propagation delay $\tau = r/c$ . At $r = 34.3$ m, $\tau = 0.1$ s — a tenth of a second of latency between emission and arrival. In a multi-source scene this time-of-flight matters: two sources at different distances arrive at different times, and reproducing those relative delays contributes to depth and to the precedence effect that locks localisation onto the first arrival. The delay line provides this for free; it is the constant part of $\tau$ .

Doppler as the time-derivative of the same delay

Doppler is simply what happens when that $\tau$ changes. There is no separate "Doppler module" — there is a delay line whose length is set by distance, and the pitch shift is the rate of change of that length. This unity has a profound practical consequence we noted earlier: because level (from inverse-distance), air absorption high-frequency roll-off (from distance), pan (from angle), and Doppler (from rate-of-change of distance) all derive from the same evolving source-to-listener vector, they are automatically consistent and synchronous. A renderer that computes distance once per sample and feeds it to a gain, an air-absorption filter, a panner, and a delay line gets all four cues correlated by construction — which, as we argued in the motion-cue section, is exactly what the brain demands.

The reverberation interaction

Distance is also rendered through the direct-to-reverberant ratio: closer sources have more direct sound relative to reverberation. A complete moving-source renderer therefore drives the reverb send from the same distance signal.

A subtlety worth flagging

The direct path Dopplers (it has a single, well-defined changing length), but the reverberant field generally should not pitch-shift wholesale, because it is the sum of countless paths of differing, slowly varying lengths whose net effect is a diffuse wash rather than a coherent tone. Applying full Doppler to a reverb tail produces an unnatural "swept" reverb. Best practice is to Doppler the direct sound and the early reflections (which have defined geometry) but leave the late diffuse tail unshifted — another case where physical nuance beats naive uniform processing.

A Full Worked Example: Aircraft Passing Overhead

Now we assemble everything into a single concrete scene and quantify it. A light aircraft flies straight and level at altitude $h = 300$ m, ground speed $v = 80$ m/s ( $288$ km/h), engine tone (idealised) at $f_0 = 120$ Hz. A listener stands directly under the flight path. Use $c = 343$ m/s. We track the source as it approaches from far ahead, passes directly overhead, and recedes.

Geometry and the three phases

Let $t = 0$ be the instant of closest approach (directly overhead). At time $t$ the horizontal position of the aircraft is $x = v \cdot t$ , so the slant distance is $r(t) = \sqrt{h^2 + (v \cdot t)^2}$ and the propagation delay is $\tau(t) = r(t)/c$ . The radial velocity (rate of change of distance) is $v_r(t) = dr/dt = v^2 \cdot t / r(t) = v \cdot (x/r)$ , which is $v \cdot \cos(\varphi)$ where $\varphi$ is the angle off the line of sight — exactly the radial-velocity decomposition from the physics section.

Far approach ( $t$ very negative, aircraft far ahead and low on the horizon): $x$ is large and negative, $r \approx |x|$ , so $x/r \approx -1$ and the aircraft is approaching at nearly full speed, $v_r \approx -v = -80$ m/s (distance shrinking). Received frequency:

$f \approx f_0 \cdot (1 + v/c) = 120 \cdot (1 + 80/343) = 120 \cdot 1.233 = 148.0 \text{ Hz}.$

(Using the exact source-moving form $120 \cdot 343/(343-80) = 156.5$ Hz; the two diverge here because $80$ m/s is a non-trivial fraction of $c$ . A per-sample distance-driven delay line reproduces the listener-form $148$ Hz; for engine sound this is entirely convincing, and the difference is a sound-design choice.)

Overhead ( $t = 0$ ): the aircraft is directly above, velocity purely tangential, $x = 0$ , so $v_r = 0$ and the listener hears the true $f = 120$ Hz. This is also the moment of peak level (minimum distance $r = 300$ m) and the moment the pan/elevation cue sweeps through the top of the head fastest.

Far recession ( $t$ very positive): $x/r \approx +1$ , $v_r \approx +80$ m/s (distance growing):

$f \approx f_0 \cdot (1 - v/c) = 120 \cdot (1 - 80/343) = 120 \cdot 0.767 = 92.0 \text{ Hz}.$

So the engine tone glides from about $148$ Hz down through $120$ Hz to about $92$ Hz — a total swing of $56$ Hz, more than a perfect fourth ( $148/92 = 1.61$ , about $8.2$ semitones). It is a big, dramatic bend, exactly what a low fast aircraft sounds like.

How fast does the bend happen — the role of miss distance

The steepness of the glide near overhead is governed by the altitude $h$ (the miss distance). The radial velocity changes from near $-80$ to near $+80$ m/s over a time window of order $h/v$ seconds. Here $h/v = 300/80 = 3.75$ s. So the bulk of the $148 \to 92$ Hz transition unfolds over roughly four seconds — a smooth, sweeping "neeeyowww". If the aircraft flew at $h = 30$ m instead (a daredevil low pass), the same swing would compress into $30/80 = 0.375$ s — a violent, almost instantaneous pitch drop, the signature "whip" of a very close fast pass. This is why low, close passes sound so much more aggressive than high ones even at the same speed: the endpoints of the Doppler swing are identical, but the rate is ten times higher. The delay line reproduces this automatically, because at low altitude the read pointer must slew through its delay change ten times faster.

Simultaneous level and pan behaviour

The three cues, tabulated at five instants. Level uses inverse-distance referenced to the overhead distance ( $0$ dB at $r = 300$ m), $-6$ dB per doubling: $\text{level}(t) = -20 \cdot \log_{10}(r/300)$ dB. Air absorption (treble loss) tracks distance similarly. Pan is described by elevation/azimuth.

Phase	$t$ (s)	$x$ (m)	$r$ (m)	$v_r$ (m/s)	Pitch (Hz)	Level (dB rel. overhead)	Spatial position
Far approach	$-15$	$-1200$	$1237$	$-77.7$	$\sim 145$	$-12.3$	Low, ahead, far
Mid approach	$-3.75$	$-300$	$424$	$-56.6$	$\sim 136$	$-3.0$	Rising ahead
Overhead	$0$	$0$	$300$	$0$	$120$	$0$	Directly above
Mid recession	$+3.75$	$+300$	$424$	$+56.6$	$\sim 104$	$-3.0$	Falling behind
Far recession	$+15$	$+1200$	$1237$	$+77.7$	$\sim 95$	$-12.3$	Low, behind, far

Read across the "overhead" row: pitch passes through $120$ Hz, level peaks ( $0$ dB, its maximum), and the source is at its angular extreme (straight up) sweeping fastest. All three events coincide — and it is that coincidence, delivered by a single distance-driven engine, that makes the flyover read as a real aircraft rather than three separate automation curves that happen to be playing together. Note also the slight asymmetry in the level/pitch columns about $t = 0$ is symmetric here because we used the symmetric listener-form shift and a symmetric geometry; with the exact source-form the approach pitch would sit a little higher, mirroring the 60 km/h car example.

A final perceptual note: the air absorption (high-frequency roll-off with distance, covered in distance and air) is most pronounced in the far-approach and far-recession phases and lifts to its brightest at overhead. So as the aircraft approaches it gets louder, brighter, and higher in pitch together, then quieter, duller, and lower together as it leaves — a triple-correlated gestalt the brain reads instantly as "a real thing went over my head".

Limits and Pitfalls

CPU and many-object scenes

Every Dopplered object needs its own delay line, its own per-sample delay-trajectory smoothing, and its own fractional interpolator. A windowed-sinc interpolator at $16$ taps costs $16$ multiply-adds per sample per object; a scene with $64$ moving objects at $f_s = 48$ kHz costs $64 \cdot 16 \cdot 48000 \approx 49$ million interpolation MACs per second for Doppler alone, before panning, filtering, and reverb. This forces engineering trade-offs: cheaper interpolators on quiet/distant objects, shared delay lines for clustered objects, or capping the number of simultaneously Dopplered sources. Memory matters too — a long maximum delay (for distant sources) means a large circular buffer per object; at $c = 343$ m/s and $f_s = 48$ kHz, a $343$ m maximum range needs $343/343 \cdot 48000 = 48000$ samples (one second) of buffer per object.

Automation smoothness is the silent killer

The silent killer

The most common cause of bad-sounding Doppler in practice is not the interpolator — it is jerky position data. Automation drawn with linear segments produces velocity steps (clicks); position updates arriving only at the video frame rate (e.g. 90 Hz in VR) produce a stair-stepped delay trajectory unless interpolated; and overly aggressive smoothing produces lag (the Doppler arrives late relative to the visual).

The fix is a properly tuned position-smoothing stage: enough smoothing to guarantee $C^1$ continuity and suppress zipper, little enough to keep the audio synchronous with the picture. Getting this latency/smoothness balance right is, in practice, more important to perceived quality than choosing between Lagrange and sinc.

Very fast sources and aliasing

When radial velocity becomes a large fraction of $c$ , the delay changes so fast that the read pointer slews through many samples per output sample. Reading a band-limited buffer at far-from-unity rates can alias if the interpolator's kernel is not adequate, and extreme up-shifts can push content above Nyquist. Renderers cap the maximum radial velocity, or oversample, or switch to higher-quality interpolation during the fastest passages. Supersonic sources ( $|v_r| \ge c$ ) are unphysical for this framework — the delay would have to run backwards — and must be special-cased (shock-front sound design, not Doppler).

Common mistakes

Block-rate delay updates without per-sample smoothing → zipper noise. The most frequent beginner error.
Linear-segment automation → velocity steps → clicks/pitch jolts. Always smooth to $C^1$ .
Treating Doppler as a separate pitch-shifter bolted onto a static panner → cues drift out of sync (pitch bend not aligned with level peak). Drive everything from one distance signal instead.
Dopplering tonal music or dialogue at full strength → audible detuning. Use per-object Doppler amount.
Dopplering the reverb tail uniformly → unnatural swept reverb. Doppler direct + early reflections, not the diffuse tail.
Forgetting temperature when the same engine must also time-align loudspeakers in calibration → propagation-delay error of a few percent.
Removing the delay entirely to "turn off Doppler" → loses static propagation delay and distance time-of-flight. Freeze the rate of change, keep the delay.
Ignoring tangential motion's zero shift → hand-drawn pitch curves that bottom out at the wrong instant (not at closest approach), which reads as physically wrong.

Relevance across domains

The same machinery serves very different fields. In live sound and immersive concert systems, Doppler matters for moving spot-effect objects and for performers tracked on stage, but is almost always disabled on the musical content itself. In automotive audio (both in-car synthesis and exterior pass-by design), Doppler realism on engine and tyre noise is a headline deliverable, scrutinised by trained ears. In game audio, Doppler on projectiles, vehicles, and the first-person listener is standard, with the "Doppler amount" exposed per emitter and CPU budgets driving interpolator choice. In VR/AR, where the listener moves continuously and head-tracking demands per-sample geometry updates, listener-motion Doppler and ultra-smooth trajectory handling are essential to presence — a stuttering Doppler instantly breaks immersion. In every case the underlying engine is identical: one distance-driven variable fractional delay line, honestly modelled, from which Doppler, propagation delay, level, air, and reverb send all flow as facets of the same moving geometry. The DAM Audio object renderer RIPL follows exactly this architecture, exposing per-object Doppler amount while keeping the distance-driven cues synchronous by construction.

The lesson, once more, is the lesson of the whole guide: a spatializer does not place a source at an angle. It reproduces the bundle of perceptual cues that make the placement believable — and for a moving source, the Doppler bend is the cue the ear trusts most to tell it how fast the world is moving.

References

Kuttruff, H. Room Acoustics, 6th ed. CRC Press, 2017. (Wave propagation, speed of sound, and the acoustic field in which moving sources radiate.)
Blauert, J. Spatial Hearing: The Psychophysics of Human Sound Localization, revised ed. MIT Press, 1997. (Auditory motion perception and the cues that signal a moving source.)
Rossing, T. D., Moore, F. R., and Wheeler, P. A. The Science of Sound, 3rd ed. Addison Wesley, 2002. (The Doppler effect derivation and the temperature dependence of the speed of sound.)
Smith, J. O. Physical Audio Signal Processing. W3K Publishing, 2010. (Delay lines, fractional-delay interpolation, all-pass and Lagrange interpolators, and modulated delay artefacts.)
Laakso, T. I., Välimäki, V., Karjalainen, M., and Laine, U. K. "Splitting the Unit Delay: Tools for Fractional Delay Filter Design." IEEE Signal Processing Magazine, 13(1), 1996. (The definitive survey of fractional-delay interpolation methods.)
Välimäki, V., and Laakso, T. I. "Principles of Fractional Delay Filters." Proc. IEEE ICASSP, 2000. (Practical fractional-delay implementation and its trade-offs.)
Zölzer, U. (ed.) DAFX: Digital Audio Effects, 2nd ed. Wiley, 2011. (Delay-line-based pitch modulation, Doppler simulation, and artefact mitigation.)
ISO 9613-1:1993, Acoustics — Attenuation of sound during propagation outdoors — Part 1: Calculation of the absorption of sound by the atmosphere. (Atmospheric conditions, including the temperature dependence relevant to the speed of sound and the air absorption that accompanies a moving-source pass.)
Begault, D. R. 3-D Sound for Virtual Reality and Multimedia. AP Professional / NASA, 1994. (Rendering moving virtual sources, including distance and Doppler cues, for interactive audio.)

← Back to the Sound Field & the Room

The Phenomenon: Everyday Experience and Wavefronts​

What you hear​

Wavefront compression and expansion​

The Physics: Speed of Sound, Geometry, and Radial Velocity​

The speed of sound and its temperature dependence​

Three cases: source moving, listener moving, both​

Only radial velocity matters​

A numeric example: a 60 km/h pass​

Why Doppler Is a Strong Motion Cue​

Doppler as a velocity estimator the brain trusts​

Interaction with level and panning during a pass​

Implementing Doppler: The Variable Fractional Delay Line​

The central insight: delay is Doppler​

The read/write pointer picture​

Fractional-Delay Interpolation in Depth​

Why fractional delay is unavoidable​

Linear interpolation​

All-pass interpolation​

Polynomial and sinc: Lagrange and windowed sinc​

Artefacts and Their Mitigation​

Zipper noise​

Clicks and discontinuities​

Transient smearing​

Trajectory smoothing as the master control​

When to Apply, Attenuate, or Disable Doppler​

Sound design and flythroughs: full Doppler​

Music and dialogue: attenuate or disable​

The "Doppler amount" control and per-object choices​

Propagation Delay, Distance Rendering, and Doppler: One Delay Line​

Static propagation delay​

Doppler as the time-derivative of the same delay​

The reverberation interaction​

A Full Worked Example: Aircraft Passing Overhead​

Geometry and the three phases​

How fast does the bend happen — the role of miss distance​

Simultaneous level and pan behaviour​

Limits and Pitfalls​

CPU and many-object scenes​

Automation smoothness is the silent killer​

Very fast sources and aliasing​

Common mistakes​

Relevance across domains​

References​