Education & Research

Teaching spatial audio and doing research with it are two sides of the same coin: both demand that you make perception audible, repeatable, and honest. A classroom that cannot reliably reproduce a phantom image is not teaching — it is asserting. A study whose playback level drifts by a few decibels between sessions is not measuring preference — it is measuring loudness. This chapter is the practitioner's guide to building a lab that does both jobs: a flexible teaching and demo rig, a catalogue of phenomena you can show on demand, the research uses that rig supports, the open formats that make your work shareable, the measurements you will take, and the discipline of running a listening test that survives scrutiny.

This is the applied capstone of the guide's pedagogical arc. Everything earlier — perception, techniques, the room, recording, and systems — converges here, because education and research are precisely the activities of demonstrating and interrogating those topics under controlled conditions.

Teaching spatial audio: the conceptual progression

Why perception comes first

The single most important pedagogical decision is ordering. Students arrive believing that spatial audio is a speaker count: "5.1," "Atmos," "more channels." If you start from formats, you reinforce that error. The fix is to start where the guide starts — with the listener. Spatial hearing is a set of cues the brain computes: interaural time differences (ITD) below roughly $1.5$ kHz, interaural level differences (ILD) above it, spectral cues from the pinna for elevation and front/back, and the precedence (Haas) effect that resolves reflections. None of these require loudspeakers; they are facts about a head. Teach those first and every technique afterward becomes "a way to deliver the cue," not "a box with channels." This mirrors /guide/fundamentals/psychoacopsychoacoustics deliberately.

The progression that works, tested across many cohorts, is:

Perception — how we localize, why phantom images form, what ITD/ILD/spectral cues are, the cone of confusion, the precedence effect, distance and reverberation cues.
Techniques — stereo and amplitude panning as the minimal case, then surround, object-based, Ambisonics, binaural, transaural, and WFS as families that each solve the delivery problem differently.
Rooms and systems — how the room corrupts and enriches the signal, and how systems (layout, time alignment, EQ, bass management, calibration) tame it.

Misconceptions to dismantle

A teacher's real work is demolition. The recurring misconceptions, and the one-line corrections that stick:

"More speakers means more immersion." No — a well-decorrelated stereo pair can out-envelop a badly fed 9.1. Immersion is about cue delivery and decorrelation, not channel count.
"Surround panning moves a sound to a point in the room." Outside the sweet spot, pairwise amplitude panning collapses to the nearest speaker (precedence). The "point" is a sweet-spot illusion.
"Binaural only works on headphones." It is the most accurate technique for one listener and the basis of head-tracked VR; over speakers it becomes transaural with crosstalk cancellation.
"Atmos is a speaker format." It is an object-based scene description rendered to whatever layout exists; the bed is channels, the objects are metadata.
"The room is a problem to remove." Some of the room (lateral reflections, late diffuse energy) is what creates envelopment. You shape it, you do not delete it.

Teach the cue, then the container

When a student asks "how do I pan to the back-left?", answer with the cue first: "What ITD, ILD, and reflection pattern would your ears need to believe a source is back-left?" Then show which technique produces it. Format-first teaching produces technicians; cue-first teaching produces engineers.

The teaching and demo system

Design goal: one rig, many techniques

A teaching lab must demonstrate stereo, surround, Ambisonics, and binaural without rewiring between examples. The design that achieves this is a horizontal ring (or small dome) driven through an Ambisonic bus, with a parallel binaural path on headphones. Ambisonics is the keystone because it is layout-independent: you author once in the B-format domain and decode to whatever speaker arrangement is in the room — 8, 12, or 16 nodes — see /guide/techniques/ambisonics. Change rooms, change the decoder, keep the content. That property is worth more in a teaching context than any single rendering quality, because labs get reconfigured constantly.

A practical baseline: a ring of $8$ identical full-range monitors on a circle of radius $r \approx 1.8$ m at ear height, plus optional $4$ height speakers for a first-order or third-order dome, one or two subwoofers with bass management, and a pair of well-matched open-back headphones for individual binaural study. Eight speakers reproduce first-order Ambisonics (FOA) cleanly and support reasonable third-order (HOA) decoding on the horizontal ring; a $16$ -node dome lets you reach third order in 3D, where the spatial resolution becomes genuinely convincing.

Why both Ambisonics and binaural

The two paths teach complementary truths. Ambisonics over the ring is shared, sweet-spot-dependent, room-coupled — the whole class hears it, but only the center seat hears it correctly, and the room is in the loop. Binaural over headphones is individual, sweet-spot-free, room-free — perfect for one listener, ideal for HRTF study and head-tracking, but it bypasses the room entirely and exposes the non-individualized-HRTF problem (front/back confusion, weak externalization). Teaching both, side by side, is how students internalize the central trade-off of the whole field.

Equipment list

Subsystem	Recommended baseline	Why / notes
Loudspeakers	$8$ matched active monitors (ring) + $4$ height	Identical drivers; matching matters more than absolute quality
Layout	Ring $r \approx 1.8$ m, ear height $\approx 1.2$ m	See /guide/systems/speaker-layouts-and-topologies
Subwoofer(s)	$1$ – $2$ with managed crossover $\approx 80$ Hz	Bass management
Interface	$16$ + channel, low-latency ( $\le 64$ -sample buffer)	Round-trip $< 10$ ms for interactive demos
Decoder / renderer	AllRAD or dual-band (max- $r_E$ ) Ambisonic decoder	Layout-independent authoring
Binaural path	Open-back headphones + SOFA-capable renderer	Individual study, head-tracking
Head tracker	IMU on headphone band (USB/BLE)	For externalization and dynamic-cue demos
Measurement	Calibrated mic (Class 1 or 2), SPL meter	/guide/systems/measurement-and-calibration
Software	DAW + Ambisonic plugin suite, REW/ARTA, test framework	Auralization, IR capture, listening tests

Layout-independence is the lab's superpower

Because the content lives in B-format, the same lesson plays in a $8$ -node teaching room, a $16$ -node research dome, and a student's headphones (binaural decode). You author the demonstration once and it survives every room you ever move it to. This is the practical payoff of Ambisonics for education.

Demonstrable phenomena

The lab earns its keep by making perception audible on command. Below are the canonical demonstrations, each tied to /guide/fundamentals/psychoacoustics, with the procedure that reliably produces the effect.

Phantom imaging and the panning law

Play a mono source (pink noise, then a voice) on a stereo pair subtending $\pm 30°$ . Sweep an amplitude pan from left to right and have students mark where the image sits. Then switch from a linear law to a constant-power ( $-3$ dB at center) law and show that the center image now stays at constant loudness. The teaching point: the phantom center exists only because both ears receive both speakers with a small ITD/ILD; move off-axis and it collapses. Quantify it — the tangent law predicts image angle $\theta$ from gains $g_L, g_R$ :

$\frac{\tan\theta}{\tan\theta_0} = \frac{g_L - g_R}{g_L + g_R}$

where $\theta_0$ is the half-base angle. Have students verify the prediction against where they actually hear it. This is the foundation of amplitude panning.

The precedence (Haas) effect

Feed identical signal to two speakers, left and right, and introduce a delay $\Delta t$ on one. At $\Delta t = 0$ the image is centered. As $\Delta t$ grows to a few milliseconds, the image jumps fully to the earlier speaker even though both are equally loud — localization is dominated by the first wavefront. The image holds on the lead up to roughly $\Delta t \approx 5$ – $30$ ms (the echo threshold, program-dependent), beyond which a discrete echo appears. Then show the trade: you can pull the image back toward the lagging speaker by raising its level — the Haas trade-off, roughly $10$ dB of extra level buys back about $\Delta t \approx 10$ ms of delay. This single demo explains delay-tower design in live sound and why time alignment matters.

The precedence demo only works if levels are matched

If your two speakers differ by even $2$ – $3$ dB before you start, students will attribute the image shift to level, not delay, and learn the wrong lesson. Calibrate both channels to within $\pm 0.5$ dB at the listening position before the demonstration. An uncalibrated rig teaches confounds.

Decorrelation and envelopment

Play the same pink noise correlated (mono) across the ring: the image is a tight, internalized blob. Now pass each channel through a different short all-pass/decorrelation filter so the inter-channel coherence drops. The sound suddenly surrounds the listener — width and envelopment appear with no change in level or spectrum. This is the cleanest demonstration of why diffuse, decorrelated energy creates envelopment and why a stereo recording with low inter-channel coherence already feels spatial — link /guide/techniques/stereo-already-spatial. Measure it: compute the inter-aural cross-correlation coefficient (IACC); envelopment rises as IACC falls.

HRTF, externalization, and head tracking

On headphones, play a static binaural rendering using a generic HRTF. Many students report the source inside or just outside the head, with front/back confusion. Now enable head tracking so the scene counter-rotates as they turn: externalization snaps on and front/back resolves, because the brain gets the dynamic ITD/ILD changes it expects from a real source. Finally, swap in the student's individualized HRTF (or a better-matched one from a SOFA database) and elevation cues sharpen. Three knobs — dynamic cues, individualization, and reverberation — each visibly improve binaural realism. This is the most memorable single demo in the course.

Phenomenon	Stimulus	Manipulation	Perceptual result
Phantom image	Mono voice/noise	Pan law, base angle	Image position / collapse
Precedence	Dual-speaker identical	Delay $0$ – $30$ ms, level trade	Image jumps to lead; echo at threshold
Decorrelation	Pink noise on ring	Inter-channel coherence	Envelopment without level change
Externalization	Binaural on headphones	Head-track on/off, HRTF swap	In-head vs. out-of-head, front/back
Distance	Dry vs. reverberant source	Direct/reverberant ratio	Near vs. far

The distance demo deserves its own line: hold a source's level constant and vary only its direct-to-reverberant ratio and air absorption, and students hear it recede. Distance is a reverberation-and-spectrum cue, not a loudness cue — a fact that surprises nearly everyone.

Research uses

The same rig is a research instrument. Four families of work dominate academic and industrial spatial-audio research, and the teaching lab supports all of them.

Architectural auralization

Auralization is the audible equivalent of architectural visualization: you take a room's geometry, simulate or measure its impulse response, and convolve dry source material with it to hear the space before it is built or to study one that exists. The workflow couples a geometric-acoustics or wave-based model (or a measured Ambisonic room impulse response) with convolution, then decodes the result over the ring or to binaural. Researchers compare predicted metrics — reverberation time $T_{30}$ , clarity $C_{80}$ , early-decay time, lateral fraction — against perceptual ratings of the auralized result. The lab's Ambisonic core is ideal here because a B-format room IR captures the full spatial early-reflection structure and decodes to any playback layout. This connects directly to reverberation and the measurement chapter.

Psychoacoustic listening tests

The lab is most often used to run formal listening tests — the topic of its own section below. The dominant methodology for small impairments is ITU-R BS.1116 (double-blind, triple-stimulus, hidden reference); for intermediate and larger impairments, MUSHRA (ITU-R BS.1534) with its hidden reference and anchors. Spatial-audio variants add localization-accuracy tasks, envelopment and source-width ratings, and preference comparisons across rendering techniques. The research question is almost always "does manipulation X produce a perceptible or preferable difference Y," and the test design exists to isolate X from every confound.

VR/AR and head-tracked binaural

Head-tracked binaural is the engine of immersive VR/AR audio, and the lab studies it directly: latency of the head-to-audio loop, the audibility of HRTF interpolation artifacts, the role of reverberation in plausibility, and the interaction between audio and visual cues. The critical metric is motion-to-sound latency: the total time from head movement to the corresponding change at the ears. Research and platform guidance converge on a budget of roughly

$t_{\text{m2s}} \lesssim 20\text{–}50\ \text{ms}$

above which the scene feels "laggy" and externalization degrades. Measuring and minimizing this latency — tracker rate, rendering buffer, convolution block size — is a standard lab exercise that ties back to the networking and integration discussion of latency budgets.

Source separation and upmixing

A vibrant research area treats the inverse problem: given stereo or mono content, recover or synthesize a spatial scene. This spans blind source separation, primary/ambient decomposition, and perceptually-motivated upmixing to surround and immersive layouts. Because most real-world content is and will remain stereo, doing this well is one of the field's highest-leverage problems — see the applied discussion in /guide/techniques/stereo-already-spatial and the worked upmixing case study in the HSR stereo upmixing post. The lab evaluates these algorithms with the same listening-test machinery: does the upmix improve envelopment without damaging timbre or stability?

Reproducibility and interchange

Why open formats are non-negotiable in research

A result that cannot be reproduced is not a result. In spatial audio, reproducibility is endangered by two specific things: proprietary, layout-locked content and undocumented HRTFs. Solve both with open formats, covered in /guide/fundamentals/formats.

Ambisonics (B-format) is the interchange ideal for scene content. Because it is layout-independent, a stimulus authored and archived in B-format can be reproduced in any lab with any speaker array, or decoded to binaural for headphone replication. Two labs with different speaker rings can run the same study on the same files — impossible with channel-locked surround masters. Use the AmbiX (ACN/SN3D) convention and document the order explicitly, because the rival FuMa convention orders and normalizes components differently and silently breaks decoders.

SOFA (Spatially Oriented Format for Acoustics, standardized as AES69) is the interchange standard for HRTFs and spatial impulse responses. Before SOFA, every lab stored head-related and room impulse responses in incompatible ad-hoc formats, so a binaural study could not be replicated without the original author's exact files and code. SOFA stores the measured responses with full geometric metadata (source/receiver positions, sampling, coordinate system), so an HRTF set is portable and a study citing "SOFA file X from database Y" is genuinely reproducible.

BW64/ADM (Broadcast Wave 64-bit with the Audio Definition Model, ITU-R BS.2076/BS.2088) carries object-based and channel-based scenes with their spatial metadata inside the file. For research that involves object-based content or that must interoperate with broadcast and post-production pipelines, ADM is the archival container — the scene description travels with the audio.

Format	Carries	Use in research	Standard
Ambisonics (AmbiX)	Layout-independent scene	Shareable stimuli, any array	— (ACN/SN3D convention)
SOFA	HRTFs, spatial IRs + geometry	Reproducible binaural studies	AES69
BW64/ADM	Object/channel scene + metadata	Object-based, broadcast interop	ITU-R BS.2076 / BS.2088
WAV/BWF	PCM + minimal metadata	Raw stimuli, anchors	EBU Tech 3285

Archive the recipe, not just the audio

A reproducible spatial study deposits: the stimuli (B-format / BW64), the HRTFs (SOFA), the decoder configuration and speaker coordinates, the calibration level in dB SPL, the room IR of the test room, and the analysis scripts. With those six items another lab can rebuild your experiment. Missing any one of them, they cannot — and "we used our usual setup" is not a method.

Measurement in the lab

Measurement underpins both teaching and research; without it your calibration drifts and your auralizations are fiction. The core measurements, tying to /guide/systems/measurement-and-calibration:

Impulse responses

The IR is the master key. Capture it with an exponential sine sweep (ESS) deconvolution — superior to MLS for nonlinearity rejection and signal-to-noise — from which you derive every room metric. From a single IR you compute reverberation time $T_{30}$ via backward (Schroeder) integration, clarity

$C_{80} = 10\log_{10}\frac{\int_0^{80\,\text{ms}} h^2(t)\,dt}{\int_{80\,\text{ms}}^{\infty} h^2(t)\,dt}\ \text{dB}$

early-decay time, and the direct-to-reverberant ratio. Teach students to read the IR before any correction: the gross room problems (a slap at $15$ ms, a flutter, a modal ring) are visible in the IR and audible in the auralization long before they appear in a smoothed magnitude curve.

Room metrics and the speaker array

For the teaching ring, measure each speaker's distance, level, and arrival time at the center seat and align them. The per-channel delay to equalize arrival is

$\Delta t_i = \frac{d_{\max} - d_i}{c}, \qquad c \approx 343\ \text{m/s}$

so a speaker $0.6$ m closer than the farthest gets about $\Delta t = 0.6/343 \approx 1.75$ ms of added delay. Level-match all channels to within $\pm 0.5$ dB with pink noise and a calibrated mic, set the reference level (commonly $79$ – $85$ dB SPL per channel for the room's purpose), and document it. This is exactly the time-alignment-and-phase and calibration workflow applied at lab scale.

HRTF measurement

Measuring individual HRTFs is the lab's most specialized task. The classic method places a subject (or a dummy head such as a KEMAR/Neumann KU 100) at the center of an arc or sphere of loudspeakers, plays an ESS from each direction, and records at microphones in the ear canals (blocked-meatus). The result is a set of head-related impulse responses indexed by azimuth and elevation. Free-field equalization removes the loudspeaker and mic response. Practical realities: subject stillness over a long measurement, dense angular sampling (hundreds to thousands of directions for research-grade sets), and storing the result as SOFA so it is reusable. For teaching, you do not need to measure every student — a curated SOFA database plus one live demonstration conveys the concept. For research on individualization, you do. Either way, an Ambisonic room IR captured with a B-format microphone gives you the spatial room response that pairs with the HRTF to produce a fully externalized binaural auralization.

The IR is the most reusable thing you will measure

One good impulse response — of a room, a speaker, or a head — feeds auralization, calibration, metric computation, and convolution reverb. Invest in capturing clean IRs (high SNR, swept sine, documented geometry) and archive them as SOFA. They outlive every project they were measured for.

Running listening tests

A listening test is an experiment on human perception, and it fails in the same ways every experiment fails: uncontrolled variables, biased presentation, and underpowered statistics. The discipline below is drawn from ITU-R BS.1116 and Bech & Zacharov's Perceptual Audio Evaluation.

Controls and blinding

The non-negotiables:

Double-blind. Neither subject nor experimenter knows which condition is playing. Single-blind leaks experimenter expectation through tone of voice and timing.
Hidden reference. In BS.1116, every trial presents the known reference (A) and two stimuli (B, C), one of which is the hidden reference and one the impairment; the subject grades both against A. This catches subjects who cannot actually hear the impairment — they grade the hidden reference as impaired too.
Randomization. Randomize trial order, condition-to-button mapping, and stimulus order. Fixed orders create learning and fatigue confounds.
Anchors (in MUSHRA). Include a low-pass-filtered anchor (e.g., $3.5$ kHz) and sometimes a mid anchor so absolute scale use is comparable across subjects.

Level calibration — the confound that ruins everything

Loudness is the most powerful and most insidious confound in audio testing: louder almost always sounds "better" and "wider." Every condition must be loudness-matched, not just peak-normalized. Calibrate playback to a fixed reference SPL with a calibrated meter, and loudness-match stimuli using ITU-R BS.1770 (the algorithm behind EBU R128) so that program loudness is equal — for spatial content, integrated loudness across all reproduced channels. A typical reference is to align all stimuli to a common integrated loudness (e.g., $-23$ LUFS for broadcast-aligned work) and to set the room's monitoring gain so that level reads a fixed dB SPL at the listening position. If two conditions differ by even $1$ dB, you are measuring loudness preference dressed up as spatial preference.

Uncalibrated level invalidates the entire study

The most common fatal flaw in amateur and even published spatial-audio tests is unmatched loudness. A renderer that is $1.5$ dB hotter will "win" on preference, externalization, and width — for reasons that have nothing to do with spatialization. Loudness-match with BS.1770, verify with a meter, and report the matching method. No reviewer should accept a spatial preference result without it.

Statistics basics

You do not need to be a statistician, but you must respect a few fundamentals:

Power and sample size. Small panels detect only large effects. Estimate the number of subjects and repetitions needed for the effect size you expect before running; an underpowered null result proves nothing.
Repetitions per subject. Repeat each condition (e.g., $3$ times) to estimate within-subject reliability and to screen unreliable listeners.
Right test for the design. Within-subject (repeated-measures) designs are the norm because each listener is their own control; use repeated-measures ANOVA or its non-parametric analogues, not tests that assume independent samples.
Confidence intervals, not just $p$ -values. Report effect sizes and intervals. A "significant" $0.1$ -point difference on a $5$ -point scale may be perceptually irrelevant.
Multiple comparisons. If you test many conditions, correct for it (e.g., Bonferroni/Holm) or your "significant" finding is likely noise.

Common confounds

Beyond loudness: visual bias (subjects who can see the speakers favor the obvious one — curtain the array), order and learning effects (counterbalance), fatigue (cap sessions at roughly $20$ – $30$ minutes with breaks), training mismatch (expert vs. naive listeners give different data — define your population), and the sweet-spot problem (only the center seat hears the intended spatial image, so test one listener at a time or document seat position).

Worked example A: standing up a teaching lab

A step-by-step build of an $8$ -speaker ring teaching room, end to end.

Step 1 — Geometry. Mark a circle of radius $r = 1.8$ m, place $8$ identical monitors at $45°$ intervals starting at $\pm 22.5°$ off front-center (so the front pair frames the screen), tweeter at ear height $\approx 1.2$ m. Aim each at the center seat.

Step 2 — Wiring and routing. Patch through a $16$ -channel interface. Insert an Ambisonic decoder (AllRAD or dual-band max- $r_E$ ) configured for the exact $8$ speaker directions. Route the binaural path to headphones via a SOFA renderer with head-tracking.

Step 3 — Time alignment. Capture an IR from the center seat for each speaker. Measure arrival times; apply $\Delta t_i = (d_{\max}-d_i)/c$ per channel so all wavefronts arrive together. Verify the corrected IRs overlay within $\pm 0.2$ ms — see /guide/systems/time-alignment-and-phase.

Step 4 — Level calibration. Play band-limited pink noise per channel; set each to equal SPL ( $\pm 0.5$ dB) at the seat with a calibrated mic. Set the monitor reference so a $-20$ dBFS RMS signal reads, say, $79$ dB SPL per channel. Document the gain.

Step 5 — Bass management. Set a $80$ Hz crossover, route the lows of all channels to the sub, align the sub's phase/delay to the mains at crossover, and check for a cancellation null — /guide/systems/subwoofers-and-bass-management.

Step 6 — Validation. Run the four canonical demos (phantom, precedence, decorrelation, externalization). If the phantom center is stable, the precedence image jumps cleanly with delay, decorrelation produces envelopment, and head-tracking externalizes — the lab is calibrated and ready. If the phantom center pulls left or right, recheck level and time alignment; that asymmetry is the lab telling you the calibration is not done.

Step 7 — Archive. Save the decoder config, speaker coordinates, per-channel delays and gains, the reference SPL, and the room IRs. This is the lab's reproducibility record; every later study cites it.

Worked example B: a small listening-test study

A complete BS.1116-style study comparing two Ambisonic decoders (basic vs. max- $r_E$ ) for localization stability.

Hypothesis. The max- $r_E$ decoder yields more stable, better-externalized images than the basic decoder for off-center as well as central listeners.

Step 1 — Stimuli. Author $6$ scenes in third-order B-format (AmbiX) — a voice at $0°$ , percussion at $\pm 90°$ , an ensemble spread across the front, etc. Decode each through both decoders, rendering the same B-format source so the only variable is the decoder.

Step 2 — Loudness match. Measure integrated loudness of each decoded stimulus (BS.1770 across all channels) and trim to a common target. Verify with a meter at the seat: all stimuli within $\pm 0.3$ dB. Record the method.

Step 3 — Design. BS.1116 triple-stimulus, hidden reference. Reference A is a high-resolution rendering (e.g., dense HOA or measured), and the two decoders are the conditions, one presented as the hidden reference where appropriate; subjects grade the impairment on the continuous $5.0$ -down-to- $1.0$ scale. $16$ trained listeners, $3$ repetitions each, randomized order, double-blind, curtained array.

Step 4 — Run. Sessions capped at $25$ minutes with breaks. Calibrated monitoring level fixed. Each subject also tested at the center seat and one off-center seat (documented), since decoder differences are largest off-axis.

Step 5 — Screen and analyze. Remove listeners whose hidden-reference grades reveal they could not hear the impairment (post-screening per BS.1116). Run repeated-measures ANOVA with decoder and seat as factors; report effect sizes and $95\%$ confidence intervals, not just $p$ . If the max- $r_E$ decoder scores higher off-axis with a non-overlapping interval, the hypothesis is supported.

Step 6 — Report and deposit. Publish the B-format stimuli, decoder configs, speaker coordinates, calibration level, room IR, and analysis scripts so the study is reproducible. State the listener population (trained), the loudness-matching method, and the seat positions explicitly.

Common mistakes and limits

Mistakes

Uncalibrated or unmatched levels. Covered above and worth repeating: the number-one invalidator. Match loudness with BS.1770; verify with a meter.
Biased tests. Single-blind procedures, experimenter cues, fixed orders, visible speakers, and leading instructions all import the result you hoped to find. Double-blind, randomize, curtain, and script the instructions verbatim.
Non-reproducible setups. "Our usual rig" is not a method. Without archived decoder configs, speaker coordinates, SOFA HRTFs, calibration level, and scripts, neither you nor anyone else can rerun the study in a year. Layout-locked content (channel-based masters tied to one room) is the worst offender; B-format and BW64/ADM fix it.
Underpowered panels. Four colleagues "preferring" condition B proves nothing. Estimate sample size for your effect size in advance.
Confusing demonstration with proof. A vivid classroom demo shows a phenomenon exists; it does not measure its magnitude. Keep the two activities — teaching and testing — methodologically distinct.
Ignoring the sweet spot. One calibrated seat is correct; the rest of the room is not. Test one listener at a time and document the seat, or you are averaging incompatible spatial experiences.
Treating generic HRTFs as ground truth. Non-individualized binaural has systematic front/back and elevation errors; conclusions drawn on one generic HRTF may not generalize. Test across HRTFs or individualize.

Limits

The teaching lab and small research rig have honest boundaries. An $8$ -node ring reproduces first-order Ambisonics faithfully and approximates higher orders; it cannot match a dense $50$ -speaker research array for spatial resolution, and its sweet spot is small. Binaural over headphones is individual and powerful but bypasses the room, so it cannot study loudspeaker–room interaction. Auralization is only as good as the underlying IR or geometric model — a coarse model produces a plausible but incorrect space, and students must be told which is which. And listening tests, however rigorous, measure perception under controlled stimuli; they do not directly predict preference for full-length music or film, where attention, narrative, and fatigue dominate. State these limits in every report. The credibility of education and research work rests not on claiming more than the rig can do, but on knowing exactly where its honesty ends.

Putting it together

A semester's arc shows how the pieces interlock. You open by teaching perception — ITD, ILD, spectral cues, precedence — using nothing but the headphone binaural path and the stereo pair, because those make the cues audible without the confound of a whole array. You then introduce techniques on the Ambisonic ring, authoring once in B-format and decoding to the room, so students grasp that Ambisonics buys layout-independence and that binaural buys individual accuracy at the cost of room realism. Midway, you calibrate the room together — IR capture, time alignment, level matching, bass management — so the lab itself becomes a worked example of systems practice and the four phenomena demos all land cleanly. Finally, students design and run a small BS.1116 study, loudness-matched with BS.1770, double-blind, archived as B-format plus SOFA plus scripts — and in doing so they practice every discipline the field demands: perception, technique, room, system, measurement, and method. The recurring truth surfaces throughout: most of the content they will ever test or teach with is stereo, so handling it well — stereo is already spatial — is not a footnote but the through-line. The lab that teaches that, calibrated and reproducible, has done its job.

References

ITU-R BS.1116-3, Methods for the subjective assessment of small impairments in audio systems. International Telecommunication Union, 2015.
ITU-R BS.1534-3 (MUSHRA), Method for the subjective assessment of intermediate quality level of audio systems. ITU, 2015.
ITU-R BS.1770-4, Algorithms to measure audio programme loudness and true-peak audio level. ITU, 2015; and EBU R128, Loudness normalisation and permitted maximum level of audio signals. EBU, 2014.
Bech, S. and Zacharov, N. Perceptual Audio Evaluation — Theory, Method and Application. Wiley, 2006.
Blauert, J. Spatial Hearing: The Psychophysics of Human Sound Localization, revised edition. MIT Press, 1997.
AES69-2022, AES standard for file exchange — Spatial acoustic data file format (SOFA). Audio Engineering Society, 2022.
ITU-R BS.2076 (Audio Definition Model) and BS.2088 (BW64), and ITU-R BS.2051, Advanced sound system for programme production. ITU.
Zotter, F. and Frank, M. Ambisonics: A Practical 3D Audio Theory for Recording, Studio Production, Sound Reinforcement, and Virtual Reality. Springer Open, 2019.

← Back to Workflows

Teaching spatial audio: the conceptual progression​

Why perception comes first​

Misconceptions to dismantle​

The teaching and demo system​

Design goal: one rig, many techniques​

Why both Ambisonics and binaural​

Equipment list​

Demonstrable phenomena​

Phantom imaging and the panning law​

The precedence (Haas) effect​

Decorrelation and envelopment​

HRTF, externalization, and head tracking​

Research uses​

Architectural auralization​

Psychoacoustic listening tests​

VR/AR and head-tracked binaural​

Source separation and upmixing​

Reproducibility and interchange​

Why open formats are non-negotiable in research​

Measurement in the lab​

Impulse responses​

Room metrics and the speaker array​

HRTF measurement​

Running listening tests​

Controls and blinding​

Level calibration — the confound that ruins everything​

Statistics basics​

Common confounds​

Worked example A: standing up a teaching lab​

Worked example B: a small listening-test study​

Common mistakes and limits​

Mistakes​

Limits​

Putting it together​

References​