Introduction to Spatial Audio

Spatial audio is sound that carries a sense of place. When you close your eyes in a forest, you do not hear an undifferentiated wall of noise: a bird is up and to your left, a stream runs somewhere ahead and below, footsteps approach from behind. Your ears and brain reconstruct a three-dimensional scene from nothing but tiny pressure changes arriving at two eardrums. Spatial audio is the craft and science of capturing, building, and reproducing that scene through loudspeakers or headphones, so that a listener perceives sound as coming from somewhere rather than simply being present.

This page is the front door to the whole guide. It assumes no prior knowledge. By the end you will know what we mean by "spatial," why it is worth the trouble, how the field grew from a single channel to fully immersive systems, the two physical paths sound can take to your ears, the three ways engineers represent a spatial scene, and the single unifying idea — encode, then decode — that ties every technique together. Later pages go deep; this one gives you the map.

What "spatial" actually means

To talk about where a sound is, we need a way to name directions and distances. The most natural coordinate system for hearing places the listener at the centre and describes every source by three quantities.

Azimuth is the horizontal angle: how far left or right the source sits, measured around you like a compass. Directly in front is $0°$ , to your right is $+90°$ , behind is $180°$ , to your left is $-90°$ (or equivalently $270°$ ). Azimuth is the dimension stereo and surround systems handle best, because human hearing is sharpest in the horizontal plane.
Elevation is the vertical angle: how far above or below ear level the source sits. The horizon is $0°$ , straight overhead is $+90°$ , straight below is $-90°$ . Adding convincing elevation is one of the defining features of modern immersive audio.
Distance is how far away the source is, in metres. Distance is encoded by loudness, by how much reverberation accompanies the direct sound, and by subtle changes in tone — far-off sounds lose their highs to the air.

Together, azimuth, elevation, and distance form a spherical coordinate system centred on the listener's head. If you prefer rectangular coordinates $(x, y, z)$ — right, front, up — the two are related by

x = r\cos\phi\sin\theta, \qquad y = r\cos\phi\cos\theta, \qquad z = r\sin\phi,

where $r$ is distance, $\theta$ is azimuth, and $\phi$ is elevation. You will rarely need this formula by hand, but it is worth seeing once: it makes explicit that direction (the two angles) and distance (the radius) are independent, and that a spatial format must somehow carry all three. Most consumer systems reproduce azimuth well, elevation partially, and distance only by implication.

Rule of thumb

Keeping azimuth, elevation, and distance apart in your mind is the single most useful habit for the rest of the guide.

We unpack how the ear actually recovers each of them in Spatial Psychoacoustics.

Why spatial audio matters

It is fair to ask whether all of this is more than a novelty. Mono recordings carried the twentieth century's music perfectly well. Three reasons explain why space is worth the engineering effort, and each shows up in everyday listening.

Realism and immersion. The world is spatial, so a spatial reproduction simply sounds more true. A thunderstorm that rolls overhead, a film car that screams past from front-left to rear-right, a concert hall whose walls you can almost feel — these land because they match how we hear the real world. Immersion is not just spectacle: a sense of being enveloped by a space, surrounded by its reverberation, is what separates "listening to a recording of a room" from "being in the room." We treat envelopment as a topic in its own right in Direct, Diffuse and Envelopment.

Clarity through separation. This is the reason engineers care about space even when realism is not the goal. When two sounds occupy the same place and the same frequency range, the louder one masks the quieter — it becomes harder or impossible to hear. Pulling sounds apart in space relaxes that competition. The classic demonstration is the cocktail party effect: in a crowded room you can follow one conversation among many, largely because the voices come from different directions and your brain uses that spatial difference to separate them. A dialogue track anchored at the centre, music spread to the sides, and effects placed overhead are easier to follow individually than the same elements stacked in one spot. Spatialisation is, among other things, a tool for intelligibility.

Scale. Some experiences are simply about bigness. A stadium show, an outdoor festival, a planetarium, a theme-park ride — these use many loudspeakers over a large area not only to be loud but to make sound feel vast and to give thousands of people a coherent experience. Scale is where the room and the playback system stop being neutral and become part of the instrument, a theme that runs through Part III, The Sound Field and the Room, and the later installation material.

Key takeaway

A good way to hold these three in mind: realism is about matching the world, clarity is about organising a mix, and scale is about filling a space. Most real projects want some blend of all three.

From one channel to a whole space

The history of audio reproduction is essentially the story of adding spatial dimensions one at a time. Following that progression is the fastest way to understand what each format does.

Mono — one channel. A single loudspeaker reproduces one stream of audio. Everything comes from one point. Mono can be beautifully clear and is still the right choice for some broadcast and public-address situations, but it carries no spatial information at all: the location of every instrument is the location of the speaker.

Stereo — two channels and the phantom image. Put two loudspeakers in front of the listener, feed them two related but distinct signals, and something remarkable happens. A sound sent equally to both speakers is heard not at either speaker but between them, floating in the middle. This is the phantom image — a source that exists nowhere physically yet is heard at a definite place. By adjusting the level (and sometimes the timing) sent to each speaker, an engineer can slide that phantom anywhere along the line between the two. Stereo thus turns two channels into a continuous horizontal stage. It is the foundation of nearly everything that follows, and it is the subject of Stereophony and the Phantom Image.

Surround — sound all around. Add speakers to the sides and behind the listener and the stage wraps into a full circle. A "5.1" system uses left, centre, right, two surround channels, and a low-frequency effects channel; "7.1" adds two more. Surround lets ambience, effects, and the occasional musical element come from any horizontal direction. The centre channel earns special mention: by giving dialogue its own dedicated speaker, surround anchors voices firmly to the screen no matter where the listener sits.

Immersive — height and objects. The newest step adds the dimension stereo and classic surround both lack: elevation. Ceiling or upward-firing speakers let sound live above the listener, completing the sphere. Hand in hand with height came a second, deeper change in how scenes are built. Instead of mixing every sound permanently into fixed channels, modern immersive systems can treat individual sounds as objects — each carrying its own position data, to be placed by the playback system onto whatever speakers are actually present. A helicopter can be told to circle overhead, and a 7-speaker living room and a 64-speaker cinema will each render that instruction as well as their hardware allows. Object-based audio is important enough to have its own page, Object-Based Audio.

The arc is consistent: mono gives a point, stereo a line, surround a circle, immersive a full sphere — and objects free a sound from any fixed channel at all. Each step adds spatial information without discarding what came before.

A fuller history

Spatial audio is older than most people assume, and tracing its milestones explains why today's formats look the way they do.

1881 — Ader's Théâtrophone. At the Paris Electrical Exhibition, Clément Ader placed a row of telephone microphones across the front of the Opéra stage and connected pairs of them to listeners' two ears over telephone lines. Listeners reported that performers seemed to move across an audible stage. This is the first documented demonstration of binaural transmission — two channels, one per ear — and it established the founding insight of the whole field: deliver the right signal to each ear and the brain reconstructs space. We return to that two-ears principle in Binaural Audio.

1931 — Blumlein's stereo patent. The British engineer Alan Blumlein filed a sweeping patent that laid out, with startling completeness, the principles of two-channel stereophony: how to capture a sound stage with coincident microphones, how level and phase differences between two channels create a stable image, and how to cut it all into a record groove. Blumlein understood that loudspeaker stereo works on a different principle from Ader's earphones — it relies on both ears hearing both speakers — and he got the mathematics essentially right. Almost every stereo recording since rests on this 1931 document.

1940 — Fantasound and Disney's Fantasia. Disney, with RCA, built a custom multi-channel system to present the animated film Fantasia. Multiple optical soundtracks drove speakers placed around the auditorium, and engineers could pan sounds between them in real time — the first commercial surround sound, decades ahead of its time. Fantasound was expensive and few theatres installed it, but it proved that an audience would respond to sound that moved through the room, not just across the screen.

1970s — Ambisonics. Michael Gerzon and colleagues in Britain developed a mathematically principled alternative to channel-counting. Rather than assigning sounds to specific speakers, Ambisonics describes the entire sound field at a point as a set of spherical-harmonic components, which can then be decoded to any speaker layout — including ones with height. It was arguably too far ahead of its market in the analogue era, but its scene-based philosophy underpins much of today's virtual-reality and 360° audio. It has a dedicated page, Ambisonics.

1976 — Dolby Stereo. Dolby's optical system encoded four channels (left, centre, right, surround) into the two-channel optical track that 35 mm film already carried, using a clever matrix that folded the extra channels in and unfolded them on playback. Suddenly any cinema could be upgraded without rewiring its projectors, and surround sound went mainstream. Star Wars (1977) and the films that followed made cinematic surround an audience expectation rather than a curiosity. Matrix encoding is covered in Surround and Matrix Systems.

1992 — Dolby Digital and discrete surround. With Batman Returns, Dolby introduced a fully discrete 5.1 digital format: five full-range channels plus a dedicated low-frequency channel, each carried independently rather than matrixed together. Discrete channels meant cleaner separation and reliable steering, and 5.1 became the backbone of DVD, digital television, and home cinema for two decades.

2012 onward — the Atmos era. Dolby Atmos, introduced with the film Brave, married two ideas that had been developing separately: height channels for true overhead sound, and object-based mixing in which sounds carry positional metadata and are rendered to whatever speakers a given room has. Competing and complementary systems (DTS:X, MPEG-H, Auro-3D) followed, and the approach spread from cinemas to soundbars, headphones, and music streaming. The defining trait of this era is that the mix and the playback layout are no longer the same thing — a single immersive master adapts itself to a phone, a living room, or a dub stage.

Read end to end, the history reveals a pendulum. Early systems tied sound to specific physical channels; Ambisonics and object-based audio broke that tie by describing the scene and letting the playback system work out the speakers. Today's tools let you choose freely between these philosophies, which is exactly why the rest of this guide is organised around techniques rather than products.

Two paths to the ears

However a spatial scene is built, it ultimately reaches a listener by one of two physical routes, and the difference between them shapes everything.

Loudspeakers. Sound radiates into a room from two or more speakers, mixes in the air, and reaches both of the listener's ears. Each ear hears every speaker, slightly delayed and coloured by the head — this is called crosstalk, and stereo and surround are designed around it. Loudspeaker reproduction is naturally shared (a roomful of people hears it at once), it engages the body with real acoustic energy, and it lets the room's own acoustics contribute to envelopment. Its weakness is that the result depends on speaker placement, room treatment, and where the listener sits; there is usually one "sweet spot" where the image is best.

Headphones. Each earpiece feeds one ear directly, with no crosstalk and no room. This is precise and private, and it makes the listener's head the entire acoustic stage. But it introduces a problem: with no head and outer ears in the signal path, naively panned sound tends to collapse inside the head rather than out in the world.

Common pitfall

On headphones, with no head and outer ears in the signal path, naively panned sound tends to collapse inside the head rather than out in the world. The solution is binaural audio, which deliberately bakes the filtering effects of a head and ears into the signal so that headphones can place sounds convincingly outside the listener — overhead, behind, at a distance.

Because headphones are everywhere, binaural rendering has become the main way immersive audio reaches a mass audience. It is detailed in Binaural Audio; the related trick of recreating headphone-like control over loudspeakers by cancelling crosstalk is covered in Transaural Reproduction.

A single immersive master is increasingly expected to serve both paths — decoded to a speaker array in a cinema, and binaurally rendered to headphones for everyone watching at home. Keeping the two delivery routes distinct in your thinking explains many otherwise puzzling design choices later in the guide.

The three representations

There are three fundamentally different ways to store and transmit a spatial scene. Almost every format in existence is one of these, or a hybrid. The full treatment is in Spatial Audio Formats; here is the preview.

Channel-based. The scene is stored as one audio signal per loudspeaker. Stereo, 5.1, and 7.1 are channel-based: "this is what the left speaker plays, this is the right," and so on. It is simple, predictable, and as old as stereo itself — but it assumes the playback system has exactly the speakers the mix was made for.

Watch out

Play a 5.1 mix on a stereo system and something must downmix it, and detail is lost. Channel-based formats assume the playback system has exactly the speakers the mix was made for.

Object-based. The scene is stored as individual sounds plus metadata describing where each one should be — its azimuth, elevation, distance, size. There are no fixed channels; at playback time a renderer reads the metadata and computes, in real time, which speakers should produce each sound and how loudly, given the speakers actually present. This is what lets one Atmos master play correctly on a phone and in a cinema. The cost is that playback now requires computation, not just routing.

Scene-based. The scene is stored as a description of the whole sound field at a point in space, independent of any sources or speakers. Ambisonics is the canonical example: a compact set of components captures sound arriving from all directions, and a decoder reconstructs that field on whatever array is available. Scene-based formats are elegant for rotation (turn your head in VR and the whole field rotates by a simple matrix) and for capture (one special microphone records the entire field), at the cost of spatial sharpness that grows only as you add more components.

A useful one-line summary: channel-based says what each speaker plays, object-based says where each sound is, and scene-based says what the field looks like everywhere. None is universally best; the right choice depends on whether you are recording reality, building a mix, or targeting unknown playback systems.

The unifying idea: encode, then decode

Underneath all this variety lies one pattern that, once seen, makes the whole field click. Every spatial-audio technique can be read as two stages:

Encode. Take the spatial scene — sources at given positions, or a captured field — and represent it in some intermediate form: stereo channels, a matrix, a set of Ambisonic components, a list of objects with metadata.
Decode. Take that representation and turn it into actual signals for the actual output — loudspeaker feeds, or a binaural pair for headphones — using whatever the playback situation provides.

Stereo encodes a position as a level difference between two channels and decodes it through two speakers into a phantom image. Ambisonics encodes a field as spherical-harmonic components and decodes them to a speaker array. Object audio encodes positions as metadata and decodes them by panning to present speakers. Binaural encodes direction as ear filters and decodes — trivially — straight to headphones.

The big idea

The genius of separating encode from decode is flexibility: one encoded master can be decoded many ways, for many rooms and devices, including ones that did not exist when the master was made.

This encode/decode lens is how Part II, Techniques, is organised. Each technique is presented as a particular choice of how to encode a scene and how to decode it again, and once you recognise the pattern, even unfamiliar formats become easy to place. The techniques themselves begin at Spatial Audio Techniques.

How to use this guide

The guide is written to be read in order, like a course, with each part assuming the one before. The level rises steadily — from this newcomer-friendly introduction to the working detail a practising engineer needs.

Part I — Fundamentals (you are here) builds the foundation. After this introduction, Spatial Psychoacoustics explains how the ear and brain actually locate sound — the interaural cues, the role of the outer ear, the limits of human hearing — because every technique is ultimately an exploitation of those mechanisms. Spatial Audio Formats then makes the three representations precise, and the Glossary collects the vocabulary you will meet throughout.
Part II — Techniques works through the methods themselves: stereo and amplitude panning, surround and matrix systems, object-based audio, Ambisonics, binaural and transaural reproduction, wave field synthesis, and a closing reflection on how stereo was already spatial all along.
Part III — The Sound Field and the Room turns from signals to physics: how sound behaves over distance and through air, how reverberation shapes our sense of a space, and how the balance of direct, diffuse and enveloping energy creates immersion.
Later parts — recording, calibration, installation, and production workflows — apply all of this to real rooms, real microphones, and real sessions. They are referred to by name as they become relevant.

You do not need mathematics to follow the narrative; the words always carry the meaning. But where an equation makes a relationship exact, the guide uses one, and the chapter pages back every mechanism with a worked numeric example. If a symbol is unfamiliar, the surrounding text explains it, and the Glossary is always one click away.

One last orientation before we begin. Everything ahead is, at bottom, about a single trick of perception: with only two ears, humans reconstruct a whole world of directions and distances. Every microphone technique, every speaker array, every line of rendering code exists to feed those two ears — directly, or through the air of a room — the signals that will make a scene appear. Understanding how the ears do it is therefore the natural next step.

References

Rumsey, F. (2001). Spatial Audio. Focal Press. A clear, practical overview of multichannel and spatial reproduction that maps closely onto the structure of this guide.
Blauert, J. (1997). Spatial Hearing: The Psychophysics of Human Sound Localization (revised edition). MIT Press. The standard reference on how listeners locate sound; foundational for the psychoacoustics that follows.
Roginska, A., & Geluso, P. (Eds.). (2017). Immersive Sound: The Art and Science of Binaural and Multi-Channel Audio. Routledge. A modern survey covering binaural, Ambisonic, and object-based immersive audio.
Gerzon, M. A. (1973). "Periphony: With-Height Sound Reproduction." Journal of the Audio Engineering Society, 21(1), 2–10. The founding paper of Ambisonics.
Blumlein, A. D. (1931). "Improvements in and relating to Sound-transmission, Sound-recording and Sound-reproducing Systems." British Patent 394,325. The original stereophony patent.
Holman, T. (2008). Surround Sound: Up and Running (2nd edition). Focal Press. A practitioner's account of channel-based surround formats and their history.

→ Continue: Spatial Psychoacoustics

What "spatial" actually means​

Why spatial audio matters​

From one channel to a whole space​

A fuller history​

Two paths to the ears​

The three representations​

The unifying idea: encode, then decode​

How to use this guide​

References​