Measurement & Calibration

Everything in the earlier parts of this guide — the precedence effect that lets the psychoacoustics of summing localization work, the direct-to-reverberant balance that governs perceived distance and envelopment, the spectral neutrality that amplitude panning assumes — depends on a physical system actually delivering those cues at the listener's ears. A loudspeaker layout that is geometrically perfect on paper can fail completely because one channel arrives 4 ms early, another is 3 dB hot, and a wall reflection at 6 ms smears the phantom image. You cannot hear these faults reliably, and you certainly cannot fix them by ear alone. Measurement is the instrument that turns "it sounds a bit off" into "the left surround is 2.1 ms late and has a +4 dB peak at 63 Hz from a floor bounce."

This chapter is the engineering core of Part V. It builds the entire practice of acoustic measurement up from a single object — the impulse response — and then shows how level, time, and equalization calibration all flow from it. It is the companion to Time Alignment & Phase, Equalization & Room Correction, and Subwoofers & Bass Management: those chapters tell you what to adjust, this one tells you how to see whether the adjustment worked.

Why Measure: Making the Invisible Visible

Calibration is the act of forcing a real system to deliver the assumptions the content was authored under. Cinema mixes assume a known reference SPL; stereo and surround images assume matched levels and coherent arrival; immersive renderers assume each speaker is timbrally neutral at the listening position. Every one of those assumptions is a measurable physical quantity. The human auditory system, by contrast, is an adaptive, context-driven interpreter — superb at recognizing a voice in a noisy room, hopeless at reporting that the response is 2.5 dB down at 4 kHz. Our hearing normalizes spectral tilt within seconds, masks low-level reflections, and cannot resolve absolute level without a reference. These are exactly the things calibration must control.

What ears miss and microphones catch

Three classes of error are nearly invisible to listening yet trivial to measure. First, small time offsets: a 1 ms inter-channel delay shifts a phantom image fully to one side (the precedence effect operates over roughly 1–5 ms), but you would never identify "1 ms" by ear. Second, comb filtering from reflections: two arrivals separated by $\Delta t$ produce notches at $f_n = (2n-1)/(2\Delta t)$ , a dense ripple that the ear partially ignores but that wrecks a measured response. Third, absolute level: humans have no built-in SPL reference, which is precisely why cinema and broadcast mandate a calibrated monitoring level.

Measurement as feedback loop

Think of calibration as a control loop. You set a target (matched levels, flat magnitude to a defined tolerance, aligned arrivals), measure the current state, compute the error, apply a correction (delay, gain, filter), and measure again to verify. Without the measurement step the loop is open — you are guessing. The discipline of measure, adjust, re-measure is what separates engineering from audio folklore.

Key takeaway

Measurement does not replace listening; it grounds it. The goal is a system that measures correctly and sounds right. When the two disagree, the disagreement itself is information — usually that you measured the wrong thing, at the wrong place, or with the wrong window.

The Impulse Response as the Master Measurement

If you could excite a system with a perfect Dirac impulse $\delta(t)$ and record the result at the listening position, the recording — the impulse response $h(t)$ — would contain everything there is to know about that linear, time-invariant (LTI) path. Every other metric in this chapter is a transformation of $h(t)$ .

Why the IR is complete

For an LTI system the output for any input $x(t)$ is the convolution

$y(t) = (x * h)(t) = \int_{-\infty}^{\infty} x(\tau)\, h(t-\tau)\, d\tau .$

Because $\delta(t)$ contains all frequencies at equal amplitude and zero phase, exciting with it and recording the output yields $h(t)$ directly. Knowing $h(t)$ , you can predict the response to any signal. The frequency response is simply its Fourier transform:

$H(f) = \int_{-\infty}^{\infty} h(t)\, e^{-j 2\pi f t}\, dt = |H(f)|\, e^{j\varphi(f)} .$

So magnitude response, phase response, group delay, and the entire time-domain story all live inside one measurement. This is why the impulse response is called the master measurement: capture it well and you can derive frequency response, reverberation time, clarity, early-reflection structure, and arrival time without re-measuring.

Anatomy of a room impulse response

Read along the time axis of a typical room IR and you see three regions. The direct sound is the first arrival, a sharp peak at time $t_d = d/c$ where $d$ is the source-to-mic distance and $c \approx 343$ m/s at 20 °C. For a mic 3 m from a speaker, $t_d = 3/343 \approx 8.7$ ms. Next come discrete early reflections — floor, ceiling, side walls — each a scaled, delayed, filtered copy of the direct sound, typically arriving within 5–80 ms. A reflection off a wall that adds 1.5 m of path length arrives $1.5/343 \approx 4.4$ ms after the direct sound. Finally the reverberant tail: a dense, exponentially decaying stochastic cloud of late reflections whose statistics determine reverberation time and the diffuse-field level.

The energy ratio between the direct peak and the tail is the measurable form of the direct-to-reverberant ratio that controls perceived distance and envelopment. The whole point of room treatment and speaker aiming is to shape this picture: a strong, clean direct arrival, controlled early reflections, and a smooth decay.

Linearity and time-invariance — the fine print

The IR fully describes a system only if it is LTI. Loudspeakers driven into compression or clipping are nonlinear; a moving listener or a fan changing the air temperature makes the system time-variant. Measurement methods differ chiefly in how gracefully they tolerate these violations, which is the subject of the next section.

Measurement Signals and Methods

You cannot generate a true Dirac impulse with enough energy — a real impulse (a pistol shot, a popped balloon) puts almost no energy into the system and is swamped by noise. Instead we excite with a long, energetic, known signal and recover $h(t)$ mathematically. Three families dominate.

Exponential sine sweep (ESS / log sweep) and deconvolution

The modern reference method, due largely to Farina and analyzed rigorously by Müller and Massarani, plays a sine whose frequency rises exponentially from $f_1$ to $f_2$ over duration $T$ :

$x(t) = \sin\!\left[ \frac{2\pi f_1 T}{\ln(f_2/f_1)} \left( e^{\,t \ln(f_2/f_1)/T} - 1 \right) \right].$

The impulse response is recovered by convolving the recording with an inverse filter — the time-reversed sweep with an amplitude envelope that compensates for the sweep's $-3$ dB/octave pink spectrum (energy falls as $1/f$ because the sweep spends logarithmically less time at high frequencies). Deconvolution in this case is a multiplication in the frequency domain:

$H(f) = \frac{Y(f)}{X(f)} .$

The log sweep has two decisive advantages. First, an enormous signal-to-noise ratio: the matched-filter compression of a long sweep concentrates all its energy into the impulse, and lengthening the sweep raises SNR by roughly 3 dB per doubling of $T$ . A 10-second sweep can deliver more than 60–80 dB of dynamic range. Second, and uniquely, the harmonic distortion products fold to negative time — they appear in the deconvolved result before $t=0$ and can simply be windowed away, leaving a clean linear IR. This lets you measure honest loudspeakers at realistic levels despite their distortion.

The price is full LTI sensitivity: any movement, door slam, or HVAC transient during the sweep corrupts the whole measurement, so the sweep must capture a quiet, static condition.

Maximum Length Sequences (MLS)

An MLS is a deterministic pseudo-random binary sequence with a nearly flat (white) spectrum and an autocorrelation that approximates a delta function. The IR is recovered by cross-correlating the response with the sequence, efficiently implemented via the Fast Hadamard Transform. MLS was the workhorse of the 1990s and is computationally cheap and tolerant of stationary background noise (averaging multiple periods improves SNR). Its weaknesses: distortion does not separate out — it spreads as noise across the whole IR — and any time variance (even slow temperature drift over a long average) smears the result. For most purposes the log sweep has displaced it.

Dual-channel FFT transfer function (the reference-vs-measured method)

This is the method behind Smaart and the "live" measurement world, and the one most used for sound-system tuning. You feed any broadband signal — pink noise, or even the program music itself — into the system. Two signals are captured: a reference taken electrically before the speaker (channel 1) and the measured acoustic signal at the mic (channel 2). The transfer function is their ratio, computed from cross- and auto-spectra:

$H(f) = \frac{S_{xy}(f)}{S_{xx}(f)}, \qquad S_{xy}(f) = \overline{X(f)}\, Y(f).$

Because the reference carries the actual stimulus spectrum, the source signal need not be flat — the division cancels it. Averaging many FFT frames suppresses noise and yields, as a bonus, the coherence function (below). A crucial practical detail: the reference must be delay-aligned to the acoustic arrival, or the transfer function shows enormous phase wrapping. Smaart's delay finder estimates $t_d$ (e.g. 8.7 ms for a 3 m throw) and inserts a compensating delay so the two channels line up. The great strength of dual-FFT is that you can measure with music, in a live show, continuously — ideal for tuning under real conditions. Its weakness is that without a controlled stimulus the SNR and low-frequency resolution are poorer than a dedicated sweep.

Method	Recovers IR via	SNR / dynamic range	Distortion handling	Noise / time-variance	Best for
Log sine sweep (ESS)	Inverse-filter deconvolution	Excellent (60–90 dB)	Folds to negative time, removable	Very sensitive — needs static, quiet room	Lab-grade room/loudspeaker IR, RT60, the master measurement
MLS	Cross-correlation (FHT)	Good with averaging	Spreads as noise	Tolerant of stationary noise; intolerant of drift	Quick checks, cheap hardware, legacy
Dual-channel FFT	$S_{xy}/S_{xx}$ , averaged	Moderate, depends on stimulus	Appears as reduced coherence	Tolerant; averages continuously	Live tuning with music/pink noise, time alignment, in-show monitoring

Which method when

Use a log sweep when the room is quiet and still and you want the cleanest possible IR for RT60, clarity, and waterfall analysis. Use dual-channel FFT when you must tune a live rig, align subs to tops with the band on stage, or watch the response change as the room fills. Many engineers do both: a sweep for the reference baseline, dual-FFT for live verification.

Frequency Response, Phase, Coherence, and Decay

From the captured data, software presents several views. Each answers a different question, and reading them together is the skill.

Magnitude response and the smoothing question

The magnitude $|H(f)|$ in dB is the most familiar plot, but a raw room measurement is a jagged mess of comb-filter ripple from reflections. We apply fractional-octave smoothing — averaging the response over a sliding bandwidth of, say, $1/3$ or $1/6$ octave — to reveal the trends the ear actually integrates. The ear's frequency resolution is roughly the critical bandwidth, near $1/6$ octave at mid frequencies, so $1/6$ -octave smoothing is a reasonable visual proxy for perceived tonal balance. But beware: smoothing hides narrow problems. A deep, narrow notch from a single reflection may be inaudible (the ear fills it in) yet a narrow peak from a room mode is very audible. Use light or no smoothing to diagnose the cause, heavier smoothing to judge the audible balance. This distinction drives the choices in Equalization & Room Correction.

Phase and group delay

The phase response $\varphi(f)$ tells you how different frequencies are delayed relative to one another. The physically intuitive form is group delay:

$\tau_g(f) = -\frac{1}{2\pi}\frac{d\varphi(f)}{df},$

the time delay experienced by the envelope of a narrow band around $f$ . A bulk acoustic delay shows as a phase that wraps linearly with frequency (constant group delay) and is removed by the delay-finder alignment. What remains is the system's own phase: crossover phase shifts, the rising group delay of a bass-reflex woofer near tuning, and all-pass behaviour. Phase matters most where two sources sum — at a crossover, or where a sub meets a top — because relative phase determines whether they add or cancel. We use it heavily in Time Alignment & Phase.

Coherence: the trust meter

Coherence $\gamma^2(f)$ , ranging 0 to 1, reports what fraction of the measured output at each frequency is linearly explained by the input:

$\gamma^2(f) = \frac{|S_{xy}(f)|^2}{S_{xx}(f)\, S_{yy}(f)} .$

It is the single most important "is this data trustworthy?" indicator in dual-FFT work. Coherence drops toward 0 wherever the measurement is corrupted: at low frequencies (insufficient averaging, long room decay), at high frequencies (noise, mic movement), in deep nulls (output near the noise floor), and anywhere a strong reflection or distortion adds energy not predictable from the input. A magnitude curve with low coherence is meaningless — never EQ a peak or fill a dip where coherence is poor. High coherence (say above 0.9) over the band you are adjusting is the green light.

Spectrogram, waterfall, and ETC

Time-frequency views show how the response evolves after the direct arrival. The cumulative spectral decay (waterfall) plots magnitude vs frequency vs time, exposing resonances that "ring" — a room mode or a cabinet resonance shows as a ridge that decays slowly. The Energy-Time Curve (ETC), the envelope (magnitude of the analytic signal) of the IR plotted in dB vs time, is the best tool for spotting individual reflections: each reflection is a spike, and its time and level relative to the direct peak tell you which surface to treat and how strongly it will color the sound. A reflection at $-10$ dB and 8 ms is a serious comb-filter and image problem; one at $-25$ dB and 40 ms is mostly benign.

Time-Domain Metrics from the Impulse Response

Once you have a clean IR, the ISO 3382 family of metrics quantifies the room's temporal behaviour. These connect directly to the perceptual quantities in Reverberation and Direct, Diffuse & Envelopment.

Arrival time and propagation delay

The first task is finding $t=0$ — the direct-sound arrival. The IR peak (or the onset, the point where energy first exceeds the noise) gives the propagation delay $t_d = d/c$ . This is the number you use for delaying one loudspeaker to time-align it with another: if the surround array is 4 m from the listener and the mains are 3 m, the surrounds arrive $4/343 \approx 11.7$ ms versus $8.7$ ms, so they are about 3 ms early relative to where they should be — delay the mains, or accept the offset, per the strategy in Time Alignment & Phase.

Reverberation time and the Schroeder integral

RT60 is derived not from the noisy raw decay but from the backward-integrated energy decay curve (Schroeder integration):

$E(t) = \int_{t}^{\infty} h^2(\tau)\, d\tau ,$

which is then expressed in dB and fitted with a straight line. Because a true 60 dB decay is rarely above the noise floor, we fit over a cleaner range and extrapolate: T20 fits from $-5$ to $-25$ dB, T30 from $-5$ to $-35$ dB, each scaled to 60 dB. EDT (Early Decay Time) fits the first $-10$ dB, scaled by 6; because it weights the early decay it correlates better with perceived reverberance than T30 does.

Clarity, definition, and the early/late split

Several ratios partition the IR energy at a boundary time (50 or 80 ms) to quantify the balance between useful early energy and the reverberant tail:

$C_{80} = 10\log_{10}\frac{\displaystyle\int_0^{80\,\text{ms}} h^2(t)\,dt}{\displaystyle\int_{80\,\text{ms}}^{\infty} h^2(t)\,dt}\ \text{(dB)}, \qquad D_{50} = \frac{\displaystyle\int_0^{50\,\text{ms}} h^2(t)\,dt}{\displaystyle\int_0^{\infty} h^2(t)\,dt}.$

$C_{80}$ (clarity, music) and $C_{50}$ (used for speech) describe whether the room favours articulation or blend; $D_{50}$ (Deutlichkeit, definition) ranges 0–1 and predicts speech intelligibility. The 50 ms boundary is the perceptual integration window for speech — reflections within it reinforce the direct sound, those beyond it are heard as echo or reverberance, exactly the precedence-window physics from psychoacoustics.

Metric	IR integration window	Tells you	Typical "good" range
EDT	first $-10$ dB, ×6	Perceived reverberance	matched to room use
T20 / T30	$-5$ to $-25$ / $-35$ dB, scaled	Reverberation time	control room 0.2–0.4 s; concert hall 1.8–2.2 s
$C_{80}$	early/late at 80 ms	Music clarity vs blend	$-2$ to $+4$ dB (music)
$C_{50}$ / $D_{50}$	early/late at 50 ms	Speech intelligibility	$C_{50} > +2$ dB; $D_{50} > 0.5$

RT60 needs SNR and a diffuse field

A reliable T30 requires at least 45 dB of clean decay above the noise floor, so the room must be quiet during the sweep. And RT60 is a statistical property of a diffuse field — in a small, heavily damped control room the decay is not exponential and "RT60" loses meaning; report EDT and the early-reflection structure instead.

Worked example: reading a control-room IR

Suppose a sweep at the mix position yields: direct peak at 8.7 ms (mic 3 m from speaker, consistent with $3/343$ ); a $-9$ dB spike at 13.2 ms; a smooth decay fitting a straight line that crosses $-35$ dB at 0.32 s into the T30 window. The 13.2 ms spike is 4.5 ms after the direct sound — a path-length difference of $4.5\,\text{ms} \times 343\,\text{m/s} \approx 1.54$ m, almost certainly the desk or floor bounce. At $-9$ dB it will comb-filter the response: first notch at $f_1 = 1/(2\times 4.5\,\text{ms}) \approx 111$ Hz, repeating. T30 extrapolates to about $0.32 \times (60/30) = 0.32$ s — wait, T30 is already scaled to 60 dB, so RT60 ≈ 0.32 s, reasonable for a control room. The action items: treat or reposition to kill the 13.2 ms reflection (the EQ cannot fix a comb filter), then re-measure.

Level Calibration

Time and EQ make the system coherent and neutral; level calibration makes it correct in absolute terms and balanced across channels. This is where a multichannel system earns the right to reproduce a mix as authored.

SPL, weighting, and the reference

Sound Pressure Level is

$L_p = 20\log_{10}\frac{p_{\text{rms}}}{p_{\text{ref}}}\ \text{dB}, \qquad p_{\text{ref}} = 20\ \mu\text{Pa}.$

A measurement microphone and meter report this, but the frequency weighting matters enormously. A-weighting mimics the ear's reduced low-frequency sensitivity at low levels and is used for noise and loudness; C-weighting is nearly flat across the audio band and is the cinema/calibration standard because it includes bass energy; Z-weighting (zero, flat) is the unfiltered reference. Equally important is the time response: slow (1 s averaging) for steady pink noise, fast for transients, and Leq (energy-equivalent level over an interval) for the most repeatable broadband calibration. A cinema channel is calibrated with band-limited pink noise read as C-weighted, slow (historically "85 dB C, slow").

Reference monitoring levels and standards

Different domains fix the absolute monitoring level so that a known electrical reference produces a known SPL:

Cinema (SMPTE / Dolby): each screen channel is set so that a pink-noise signal at $-20$ dBFS RMS (the standard reference level) produces 85 dB SPL, C-weighted, slow at the reference listening position. Surround channels are individually set 3 dB lower in some configurations to account for arrays; subs are set for the LFE in-band gain (+10 dB).
ITU-R BS.1116 (critical listening / broadcast): specifies the room, the listener geometry, and a reference reproduction level per channel (commonly 78–85 dB SPL depending on the number of simultaneous channels and program), with strict background-noise and reverberation limits so that small impairments are audible.
The K-System (Katz): ties metering to monitoring. K-20, K-14, and K-12 place the $0$ VU / reference point at $-20$ , $-14$ , or $-12$ dBFS, and you calibrate the monitor so pink noise at that reference reads 83 dB SPL C per channel. K-20 suits wide-dynamic-range film and orchestral work; K-14 broadcast/pop; K-12 loud broadcast.

The unifying idea: pick a dBFS reference, decide what SPL it should produce, and set each channel's gain to make it so. Then the mix engineer's $-20$ dBFS is everyone's 85 dB, and dialogue mixed "at the right level" plays back at the right level everywhere.

The single most common level error

Calibrating the LFE/sub channel like a main channel. The LFE has a standardized +10 dB in-band gain: when you send the same band-limited pink noise, the sub must read about 6 dB higher on a broadband meter than a single screen channel (the exact number depends on the measurement bandwidth and whether you use band-limited pink noise per the Dolby procedure). Set it flat and the bass is grossly under-level; set it by ear and it is usually 6–10 dB too hot. Follow the published procedure and verify with a real-time analyzer.

Worked example: per-channel SPL calibration of a 5.1 system

Goal: each full-range channel produces 85 dB SPL C, slow, from $-20$ dBFS band-limited pink noise at the listening position; LFE set per the +10 dB convention.

Place the calibrated mic at the reference seat, ear height, pointing up (for a diffuse/omni capture) or per the meter's calibration.
Route pink noise at $-20$ dBFS RMS to the Left channel only. Read the meter, C-weighted, slow. Suppose it reads 82.0 dB. The target is 85.0, so raise the Left channel gain by $+3.0$ dB. Re-measure: 85.0 dB.
Repeat for Center, Right, Left Surround, Right Surround. Suppose Center reads 86.5 → cut 1.5 dB; LS reads 83.5 → add 1.5 dB; and so on. Record every trim.
Verify the matching by ear and meter: with all channels individually at 85, the front three should image a centered phantom that does not pull left or right.
LFE: send the LFE reference pink noise; set the sub so it reads the specified amount higher (commonly the in-band level corresponds to the +10 dB LFE gain; using the Dolby band-limited procedure this is typically the channel level read in the 20–80 Hz band plus the offset). Confirm with the analyzer that the sub-to-mains transition is smooth, not a hump.

The end state is a system where a known digital level maps to a known acoustic level on every channel, within ±0.5 dB. That tolerance is audible in imaging: a 1 dB inter-channel error shifts a center phantom noticeably, a 0.5 dB error is the practical floor.

Level-match before you EQ

Always set inter-channel levels before applying any EQ, then re-check levels after, because broadband EQ changes the integrated SPL. The order is: time-align, level-match, EQ, then re-verify level and re-verify time. Iterate until all three are stable.

Microphone Choice and Placement

The microphone is the eye of the whole process; a poor mic or a poor position invalidates everything downstream.

Choosing the microphone

For acoustic measurement you want a calibrated omnidirectional condenser with a flat, known magnitude response and, ideally, a supplied individual calibration file that the software applies to flatten residual deviations. Omni because you want to capture the true sound field including reflections (a cardioid would suppress rear energy and misrepresent the room). Free-field calibrated mics (flat for on-axis sound, like 0° incidence) suit measuring a single loudspeaker on-axis; random-incidence/diffuse-field calibration suits reverberant-field and RT60 work. A measurement mic costs a fraction of a studio mic precisely because it trades character for honesty. Always check it against an acoustic calibrator (a piston-phone producing, e.g., 94 dB at 1 kHz) before SPL-critical work — this sets the absolute reference the entire level calibration depends on.

Placement and orientation

Mic position is geometry. For a single-speaker on-axis response, place the mic at the design listening distance on the acoustic axis. For room/system calibration, place it at the reference listening position, at ear height (about 1.2 m seated). Keep the mic on a stand well away from the floor, console, and your own body — you are a large reflector. Orientation should match the mic's calibration: a free-field mic points at the source; for diffuse measurements a vertical orientation reduces directional bias. Mind that any nearby hard surface within a metre creates a reflection that will comb-filter the high end.

Spatial averaging across the audience

A single point is a single comb-filter fingerprint; move the mic 20 cm and the ripple pattern changes completely, especially above a few hundred Hz. The audience occupies a region, so calibration must represent the region, not one seat. Spatial averaging captures the IR at several positions across the listening area and combines them. The right way to average magnitude responses is power (RMS) averaging:

$\overline{|H(f)|} = \sqrt{\frac{1}{N}\sum_{n=1}^{N} |H_n(f)|^2 } ,$

which represents the average energy a listener experiences and prevents position-specific deep nulls from dominating. A common scheme is a cluster of 5–9 positions: the prime seat plus points roughly ±0.3–0.5 m around it, expanding to cover the front and rear rows for a larger venue. Low frequencies (room modes) vary slowly in space and need wide spacing across the room; high frequencies vary fast and need a tight local cluster. Average modal data from positions spread across the room; average HF data from a tight grid around each seat.

Don't average phase or coherence blindly

Power-averaging magnitude is correct, but you cannot meaningfully power-average phase, and an average that mixes positions with high and low coherence is misleading. Keep individual measurements to diagnose, and use the spatial average only to set broadband targets. For time alignment, measure at the single reference position — delay is a point property.

Tools of the Trade

The mathematics is universal; the software packages differ in workflow, philosophy, and licensing. Knowing which tool fits which task saves hours.

Tool	Core method(s)	Strengths	Typical use	Cost model
REW (Room EQ Wizard)	Log sweep, RTA, dual-channel	Free, superb IR/RT60/waterfall, EQ filter design, sub alignment	Studio/home room measurement, RT60, EQ target design	Free
Smaart (Rational Acoustics)	Dual-channel FFT, transfer function	Live, continuous, multi-mic, the live-sound standard for alignment	Touring/install system tuning, time-align, in-show monitoring	Commercial
ARTA	Sweep, MLS, dual-channel, STEPS	Lab-grade IR + distortion (THD vs level), loudspeaker analysis	Loudspeaker/driver measurement, distortion, impedance	Low-cost license
Open Sound Meter	Dual-channel FFT transfer function	Free, cross-platform, real-time, good for alignment	Budget live tuning, education, multi-platform	Free / open source

A practical division of labour: use REW for the deep reference IR, RT60, clarity metrics, waterfall, and to design EQ filters with its target curve and auto-EQ; use Smaart or Open Sound Meter to tune live — time-align subs to tops with music playing and watch coherence in real time; use ARTA when you need loudspeaker distortion-vs-level or impedance data the others do not provide. They all rest on the impulse response and transfer-function mathematics in this chapter, so the reading skill transfers directly between them.

Trust the physics, not the brand

Every one of these tools computes $H(f) = Y(f)/X(f)$ or deconvolves a sweep. If two tools disagree, the difference is in windowing, smoothing, averaging, or calibration — not in reality. When a measurement surprises you, check the gate length, the smoothing setting, the reference delay, and the mic cal file before believing the result.

A Step-by-Step Calibration Session

Here is an end-to-end session for a multichannel monitoring room, integrating everything above. Treat it as a checklist; the [installation walkthrough in the Workflows part] follows the same arc at venue scale.

1. Prepare and verify the chain

Confirm the speaker layout geometry and that all channels are wired to the correct outputs with correct polarity. Set the room quiet (HVAC ideally off during sweeps). Calibrate the measurement mic with the acoustic calibrator: play 94 dB at 1 kHz, set the software's input gain so it reads 94.0 dB. Load the mic's individual calibration file. This single step underpins every SPL number you will read.

2. Establish the reference position and capture baselines

Place the mic at the reference seat, ear height. Capture a clean log sweep per channel with the room static. Save these as the "before" baseline — you will compare against them to prove the calibration worked. Inspect each IR: find the direct arrival, scan the ETC for early reflections, glance at the raw magnitude and coherence.

3. Treat gross acoustic problems first

If the ETC shows a strong early reflection (say $-8$ dB at 5 ms from the desk), fix it physically — reposition the mic/speaker, add absorption, tilt the desk. No EQ removes a comb filter caused by a reflection, because the cancellation is position- and frequency-dependent. Re-measure after each physical change.

4. Time-align

Using the dual-FFT delay finder or the IR onset, read each channel's propagation delay. Align the subwoofer(s) to the mains at the crossover by adjusting sub delay until the summed response shows maximum addition (and coherence stays high) through the crossover band — the procedure detailed in Time Alignment & Phase and Subwoofers & Bass Management. Worked check: if the sub is 0.5 m closer than the mains, it arrives $0.5/343 \approx 1.5$ ms early — add 1.5 ms of sub delay and confirm the crossover-band notch fills in.

5. Level-match all channels

Run the per-channel SPL calibration from the earlier worked example: $-20$ dBFS pink noise, target 85 dB C slow (or your chosen K-system / BS.1116 reference), trim each channel to within ±0.5 dB, set the LFE per the +10 dB convention. Record every trim value.

6. Equalize to a target

With levels matched, design EQ from the spatially-averaged magnitude. Correct broad tonal trends and tame audible modal peaks with minimum-phase filters; do not fill narrow nulls (they are reflection cancellations that EQ cannot fix and that waste headroom). Apply a gentle target curve appropriate to the room and use case, per Equalization & Room Correction. Only EQ where coherence is high.

7. Re-verify everything and document

Re-measure levels (EQ changed them), re-check time alignment, capture a new "after" sweep per channel. Compare to the baseline: the magnitude should be smoother, the crossover should sum, the inter-channel levels should match. Save the project, the trims, the filter settings, and a photo of the mic position. A calibration you cannot reproduce is a calibration you cannot defend.

Worked example: closing the loop on one channel

Left surround baseline: arrival 11.7 ms (mic 4 m), pink-noise SPL 83.4 dB C, a +5 dB modal peak at 63 Hz (coherence 0.95 there), and a $-10$ dB reflection at 6 ms (path +2 m). Actions: (a) the reflection is physical — add a panel, re-measure, reflection drops to $-20$ dB; (b) align — the mains arrive at 8.7 ms, so LS is 3 ms late; the renderer already accounts for distance, so leave LS and instead confirm the whole bed is consistent; (c) level — raise LS by $+1.6$ dB to hit 85.0; (d) EQ — a $-4$ dB, $Q=4$ filter at 63 Hz tames the mode (coherence high, so it is real and EQ-able). Re-verify: 85.0 dB, response within ±2 dB to 200 Hz, the 63 Hz peak gone. That is one channel; repeat for all.

Common Mistakes, Pitfalls, and Limits

Pitfalls that ruin calibrations

Single-point measurement. One mic position shows one comb-filter fingerprint; EQ-ing its narrow dips "corrects" a problem no one else hears and makes other seats worse. Spatially average.
Ignoring coherence. Adjusting magnitude where $\gamma^2$ is low means you are EQ-ing noise, reflections, or distortion. Always read coherence first.
Wrong reference level or weighting. Mixing up A- and C-weighting, or forgetting the LFE +10 dB, throws the entire balance off. Document the standard you are using.
Gating/windowing errors. A time window (gate) on the IR sets the low-frequency resolution: a window of length $T_w$ cannot resolve below roughly $f_{\min} \approx 1/T_w$ . Gate too short (e.g. 5 ms → no data below 200 Hz) and you miss the bass; gate too long and you include reflections and room decay that the EQ should not chase. Use a long gate for bass, a short gate to see the speaker's own anechoic response.
EQ-ing reflection nulls. Boosting to fill a deep, narrow null caused by a reflection burns headroom and risks the amplifier, and the null moves with mic position anyway. Treat the reflection, do not EQ it.
Forgetting polarity and acoustic vs electrical delay. A reversed-polarity channel shows a deep null at the crossover that no delay fixes; check polarity before chasing phase.
Measuring a nonlinear or time-varying system. Sweeping while someone walks, the AC cycles, or the amp clips violates LTI and corrupts the IR. Quiet, static, and within the system's linear range.

Limits of measurement

Measurement is necessary but not omniscient. A microphone is not a pair of ears: it has no head, no pinnae, no binaural processing, so it cannot directly report perceived spaciousness, image stability, or the way the precedence effect suppresses a reflection. The IR is only valid for the linear, time-invariant part of the system at the point it was captured; the moment the room fills with bodies (changing absorption and temperature) the calibration drifts, which is why dual-FFT verification during a show is valuable. Smoothing and gating are choices, not truths — every displayed curve is a model with assumptions baked in, and different settings tell different stories about the same data. And no measurement resolves the fundamental tension that a small-room low-frequency response is genuinely different at every seat: you can optimize the average, but you cannot make all seats identical with EQ alone — that requires multiple subwoofers and physics, per Subwoofers & Bass Management.

The honest stance is this: measurement removes the guesswork — the inaudible time offsets, the level mismatches, the modal peaks, the reflections — and leaves you a system that is provably coherent, level-matched, and neutral. The final aesthetic judgment, made by ear in the calibrated room, is then a judgment about music, not about whether the left surround is secretly 3 dB hot. That is the whole point: calibration earns you the right to trust your ears.

The one-line summary

Capture a clean impulse response, read its time and frequency stories with coherence as your trust meter, fix reflections physically, then time-align, level-match, and EQ — measuring after every change. Reproducibility and documentation turn a good session into an engineering result.

References

Müller, S., and Massarani, P. (2001). "Transfer-Function Measurement with Sweeps." Journal of the Audio Engineering Society, 49(6), 443–471. The definitive analysis of sweep-based measurement, deconvolution, and distortion separation.
Farina, A. (2000). "Simultaneous Measurement of Impulse Response and Distortion with a Swept-Sine Technique." AES 108th Convention, Preprint 5093. Origin of the exponential sine sweep method.
ISO 3382-1:2009 and ISO 3382-2:2008. Acoustics — Measurement of room acoustic parameters (RT60, EDT, C80, D50, and the engineering-method procedures).
ITU-R BS.1116-3 (2015). Methods for the subjective assessment of small impairments in audio systems, including reference room, geometry, and monitoring-level specifications.
ITU-R BS.775. Multichannel stereophonic sound system with and without accompanying picture (channel layout and level reference for multichannel calibration).
Davis, D., Patronis, E., and Brown, P. (2013). Sound System Engineering, 4th ed. Focal Press. Measurement, gain structure, and acoustic analysis for installed systems.
Toole, F. E. (2017). Sound Reproduction: The Acoustics and Psychoacoustics of Loudspeakers and Rooms, 3rd ed. Routledge. Spatial averaging, smoothing, and what measurements do and do not predict perceptually.
Rational Acoustics, Smaart v9 User Guide, and Room EQ Wizard (REW) Help documentation. Practical dual-FFT transfer-function and impulse-response measurement workflows.

← Back to Systems & Calibration

Why Measure: Making the Invisible Visible​

What ears miss and microphones catch​

Measurement as feedback loop​

The Impulse Response as the Master Measurement​

Why the IR is complete​

Anatomy of a room impulse response​

Measurement Signals and Methods​

Exponential sine sweep (ESS / log sweep) and deconvolution​

Maximum Length Sequences (MLS)​

Dual-channel FFT transfer function (the reference-vs-measured method)​

Frequency Response, Phase, Coherence, and Decay​

Magnitude response and the smoothing question​

Phase and group delay​

Coherence: the trust meter​

Spectrogram, waterfall, and ETC​

Time-Domain Metrics from the Impulse Response​

Arrival time and propagation delay​

Reverberation time and the Schroeder integral​

Clarity, definition, and the early/late split​

Worked example: reading a control-room IR​

Level Calibration​

SPL, weighting, and the reference​

Reference monitoring levels and standards​

Worked example: per-channel SPL calibration of a 5.1 system​

Microphone Choice and Placement​

Choosing the microphone​

Placement and orientation​

Spatial averaging across the audience​

Tools of the Trade​

A Step-by-Step Calibration Session​

1. Prepare and verify the chain​

2. Establish the reference position and capture baselines​

3. Treat gross acoustic problems first​

4. Time-align​

5. Level-match all channels​

6. Equalize to a target​

7. Re-verify everything and document​

Worked example: closing the loop on one channel​

Common Mistakes, Pitfalls, and Limits​

Pitfalls that ruin calibrations​

Limits of measurement​

References​