Object-Based Audio & Rendering

Every spatialization method we have studied so far shares a hidden assumption: that the people who author a mix know, at the moment of authoring, exactly what playback system will reproduce it. A stereo mix assumes two loudspeakers at $\pm 30^\circ$ . A 5.1 mix assumes the five ITU positions. A 7.1.4 mix assumes seven ear-level speakers, a subwoofer, and four height speakers in specific places. The signal that leaves the studio is a set of channel feeds — one waveform per loudspeaker — and the spatial meaning of those waveforms is encoded entirely in the relationship between the feeds and the geometry they were mixed for. Change the geometry and the meaning degrades.

Object-based audio breaks that assumption. Instead of shipping loudspeaker feeds, it ships sources: each sound as an independent audio signal accompanied by metadata that describes where the sound should be, how big it should be, and how loud, as a function of time. The loudspeaker feeds do not exist yet when the master is created. They are computed at the moment of playback, by a piece of software called a renderer, which knows the actual geometry of the room it is playing into. One master, authored once, becomes a 2.0 soundbar mix, a 5.1.2 living-room mix, a 9.1.6 home-cinema mix, a 64-speaker commercial cinema mix, or a binaural headphone mix — each derived on demand.

This chapter develops the object model from first principles: the metadata, the carrier formats, the rendering mathematics, the bed-plus-objects architecture, and the concrete systems (Dolby Atmos, MPEG-H, DTS:X) that implement it. Throughout, keep one idea in view, because it is the unifying theme of all of Part II: rendering is an encode-then-decode operation. The object metadata is an encoding of an intended sound field into a compact, layout-independent representation; the renderer is the decoder that derives the feeds for the system present. This is exactly the structure we met in amplitude panning and Ambisonics, and it is the deeper reason that even a humble stereo pair is, as we argue in stereo is already spatial, a special case of the same machinery.

From channel lock to the object idea

Why channel-based audio is locked to a layout

In channel-based audio the deliverable is a fixed number of signals, one per nominal loudspeaker position. Consider a 5.1 mix. The left-surround channel $L_s$ carries whatever the mixing engineer panned to the left-surround speaker, which in the ITU-R BS.775 layout sits at azimuth $110^\circ$ . If a consumer plays that same $L_s$ feed through a speaker placed at $90^\circ$ , or worse, through a soundbar that has no surround speaker at all, the intended image collapses. The feed is correct only for the geometry it was authored for.

There are two ways out of this for channel-based content, and both are unsatisfactory. The first is re-mixing: produce a separate master for every target layout. This multiplies studio labour and storage, and it does not scale to the combinatorial explosion of consumer systems (every soundbar vendor has a different driver count). The second is downmixing/upmixing: apply a fixed matrix to convert one channel count to another, as in the Lo/Ro and Lt/Rt coefficients of surround and matrix systems. Downmix matrices are blunt instruments: they sum channels with fixed gains and cannot recover the spatial intent, because by the time you hold only the feeds, the source positions have been irreversibly baked in. A sound that should be a tight point at $110^\circ$ and a sound that should be a wide ambience filling the rear are both, in $L_s$ , just "signal in the left-surround feed." The matrix cannot tell them apart.

Key takeaway

The core defect is therefore an information defect. Channel feeds throw away the authoring intent and keep only its projection onto one particular basis (the chosen speaker layout). Once projected, the intent cannot be re-projected onto a different basis without loss.

The object idea: audio plus positional metadata

An audio object keeps the intent. It consists of two things:

an audio essence — a single monophonic (occasionally multichannel) waveform, the dry sound of the source, with no panning applied; and
a stream of metadata — a time-varying description of where that sound is, how large it is, and how loud, in a coordinate system tied to the room, not to any particular speaker.

Formally, an object is a function of time carrying a signal $s(t)$ and a parameter vector $\boldsymbol{\theta}(t)$ :

\text{object} = \big\{\, s(t),\; \boldsymbol{\theta}(t) = (\text{position}, \text{size}, \text{gain}, \ldots) \,\big\}.

The crucial property is that $\boldsymbol{\theta}(t)$ contains no reference to loudspeakers. It says "this helicopter is at azimuth $45^\circ$ , elevation $30^\circ$ , three-quarters of the way to the room boundary, at $-6$ dB." It does not say "put $0.7$ of it in the right speaker." Translating from the first statement to the second is the renderer's job, and the renderer does it differently for every room. This is the same separation of concerns we will see in object-based formats: the authoring side describes what is intended; the playback side decides how to realise it here.

Because the metadata is layout-independent, a single object master is future-proof in a way a channel master never is. A loudspeaker layout that did not exist when the content was authored — say, a 22.2 array or a novel soundbar beam-forming configuration — can still render the object correctly, because the renderer for that system simply reads the same position metadata and computes the appropriate feeds.

The metadata model in detail

The expressive power of object-based audio lives entirely in its metadata model. Get the model right and the renderer can do its job; get it impoverished and no renderer can recover what was never described. We now go through the model field by field, then connect it to the standard that codifies it.

Position: coordinate systems

Position is the heart of the metadata. Two coordinate conventions dominate.

Polar (spherical) coordinates describe a source by azimuth $\theta$ , elevation $\phi$ , and distance (or normalised radius) $r$ :

\theta \in (-180^\circ, 180^\circ], \quad \phi \in [-90^\circ, 90^\circ], \quad r \in [0, 1].

Azimuth is the horizontal angle (conventionally $0^\circ$ straight ahead, positive to one defined side), elevation is the vertical angle above or below the listener's horizon, and $r$ is the normalised distance from the listener at the centre to the source, where $r=1$ typically means "at the boundary of the reproduction volume." Polar coordinates match human spatial perception directly — recall from psychoacoustics that the auditory system localises chiefly by direction (interaural time and level differences for azimuth, spectral cues for elevation), so an angular description is the natural one.

Cartesian (allocentric) coordinates describe a source by its position $(x, y, z)$ inside a normalised cube, each axis running from $-1$ to $+1$ , with the listener (or room centre) at the origin and the cube faces corresponding to the walls, floor, and ceiling. The two systems convert as

x = r\cos\phi\sin\theta, \quad y = r\cos\phi\cos\theta, \quad z = r\sin\phi,

with the inverse

r = \sqrt{x^2 + y^2 + z^2}, \quad \theta = \operatorname{atan2}(x, y), \quad \phi = \arcsin\!\left(\frac{z}{r}\right).

The choice matters in practice. Polar rendering keeps a source at a constant angle as the room scales, which is what you want for a far-away point source. Cartesian rendering keeps a source at a constant relative position in the box (e.g. "exactly in the left-rear corner"), which is what you want for an effect pinned to a wall regardless of room shape. Dolby Atmos uses an allocentric (Cartesian, room-anchored) model internally; the channel-allocation conventions of MPEG-H and the ADM allow both.

Polar vs Cartesian

The distinction is not cosmetic: the same metadata under the two interpretations places sounds in measurably different physical positions in any non-cubic room.

Worked example: polar to Cartesian

Take an object at azimuth $\theta = 30^\circ$ , elevation $\phi = 20^\circ$ , full radius $r = 1$ . Then

\cos\phi = \cos 20^\circ = 0.9397, \quad \sin\phi = \sin 20^\circ = 0.3420,

x = 1 \cdot 0.9397 \cdot \sin 30^\circ = 0.9397 \times 0.5 = 0.4698,

y = 1 \cdot 0.9397 \cdot \cos 30^\circ = 0.9397 \times 0.8660 = 0.8138,

z = 1 \cdot \sin 20^\circ = 0.3420.

So the source sits at $(0.47, 0.81, 0.34)$ in the unit cube — front, slightly to one side, a third of the way up toward the ceiling. We will render this same object to two systems later in the chapter.

Size and spread

A point source is an idealisation. Real sources — a choir, an ocean, a passing train, a synth pad meant to wrap around the listener — have spatial extent. The metadata captures this with a size (or spread, or width) parameter, typically a value in $[0,1]$ per dimension (width, depth, height) or a single scalar.

Physically, size says "this sound should appear to emanate not from a single point but from a region of the given angular extent." A renderer realises size by spreading the energy across more loudspeakers than a point source would use, so that the listener's localisation is deliberately blurred over the target region. At size $0$ the object is a tight point handled by the minimum number of speakers (two or three for amplitude panning); at size $1$ it is diffuse, distributed across most or all of the array. We make the mechanism precise in the rendering section.

Gain, snap, and divergence

Beyond position and size, the model carries several practical controls:

Gain: a per-object level, often automatable in decibels, applied before rendering. This is distinct from the panning gains the renderer computes.
Snap: a flag instructing the renderer that if the object's position coincides (within a tolerance) with a physical loudspeaker, the renderer should route the object entirely to that one speaker rather than panning across its neighbours. Snap preserves timbral purity and maximum localisation sharpness for sounds that happen to land on a real speaker — useful for discrete effects and for dialogue pinned to the centre.
Divergence: a control that splits a centre-ish object between the centre speaker and the left/right pair, trading a hard phantom-free centre image against a wider, more enveloping one. It is the object-domain analogue of the classic dialogue-spreading trick.

Zones, exclusion, and importance

Production-grade models add constraints that shape which speakers may be used:

Zone control (zone masks, speaker-group exclusions) lets the author forbid certain regions of the array. For example "never play this object from the screen speakers" or "keep this object out of the height layer." The renderer honours the mask by excluding those speakers from the gain computation and renormalising.
Importance / priority: an integer ranking used when the playback system has fewer resources (fewer discrete rendering slots) than the content has objects. Low-importance objects may be culled or folded into a bed; high-importance objects (dialogue, lead vocal) are preserved.

The Audio Definition Model and the BW64 carrier

A metadata model is useless without a standard way to write it down and ship it. The Audio Definition Model (ADM), standardised as ITU-R BS.2076, is that standard. ADM is an XML schema that describes, for a given audio programme, the complete tree of what the audio is and where it goes. Its key elements are:

audioObject — links a chunk of audio essence to its metadata over a time span;
audioPackFormat and audioChannelFormat — describe the type of content (the ADM defines four type definitions: DirectSpeakers for channel-bed content, Matrix for matrixed content, Objects for dynamic object content, and HOA for higher-order Ambisonics);
audioBlockFormat — the time-segmented carrier of the actual dynamic metadata (position, gain, size, jumpPosition, etc.), so that an object can move continuously by chaining blocks;
audioTrackUID and audioTrackFormat — bind metadata to the physical PCM tracks in the file.

Critically, ADM is transport-type agnostic: the same XML can describe channels, objects, and scene-based (Ambisonic) content side by side in one programme. That is why ADM is the lingua franca for object delivery and why it underpins the production-exchange master format we discuss below.

The audio itself is carried in BW64 (Broadcast Wave 64-bit), standardised as ITU-R BS.2088. BW64 is an extension of the classic Broadcast Wave (BWF/RF64) file that removes the 4 GiB size limit (so long multichannel masters fit) and adds a <chna> chunk listing the track UIDs and an <axml> chunk holding the ADM XML. So a single .wav file contains, simultaneously, the PCM essence (one track per object plus the bed tracks) and the full ADM description of how to interpret it. This pairing — ADM metadata inside a BW64 container — is the open, interchange-grade object master. We return to its place among delivery formats in formats.

The renderer: turning metadata into per-speaker gains

The renderer is where the abstraction meets the loudspeakers. Given an object's position and the known positions of the actual speakers, the renderer computes a gain for each speaker such that the summed contribution produces a phantom image at the intended position. For point-like sources at ear level and above, the workhorse is amplitude panning, and specifically a vector-based formulation.

Amplitude panning recap

The foundational result, developed fully in amplitude panning, is that two (or three) loudspeakers driven with the same signal at different gains produce a phantom source between them. For a stereo pair the perceived azimuth $\theta$ of the phantom relates to the gains $g_L, g_R$ by the tangent law:

\frac{\tan\theta}{\tan\theta_0} = \frac{g_L - g_R}{g_L + g_R},

where $\theta_0$ is the half-angle between the speakers. To keep the perceived loudness constant as the image moves we constrain the gains to constant energy:

g_L^2 + g_R^2 = 1.

The pair of equations — the localisation law plus the normalisation — has a unique solution for any target angle inside the speaker arc. That is the encode/decode operation in microcosm: the target angle is the encoded intent, the gains are the decoded feeds.

VBAP: vector base amplitude panning

For three-dimensional layouts the renderer uses VBAP (Pulkki, 1997). The idea is geometric. Represent each loudspeaker by a unit vector $\mathbf{l}_k$ pointing from the listener to that speaker. To place a source in the direction of unit vector $\mathbf{p}$ , choose the triplet of speakers whose directions bound the triangle (on the unit sphere) containing $\mathbf{p}$ , and express $\mathbf{p}$ as a non-negative linear combination of those three speaker vectors:

\mathbf{p} = g_1 \mathbf{l}_1 + g_2 \mathbf{l}_2 + g_3 \mathbf{l}_3, \qquad g_k \ge 0.

Stacking the three speaker vectors as the columns of a matrix $\mathbf{L} = [\mathbf{l}_1\;\mathbf{l}_2\;\mathbf{l}_3]$ , the raw gains are found by inverting the base:

\mathbf{g} = \mathbf{L}^{-1}\mathbf{p}.

Then we normalise for constant energy:

\hat{g}_k = \frac{g_k}{\sqrt{g_1^2 + g_2^2 + g_3^2}}.

VBAP has elegant properties. If $\mathbf{p}$ coincides with a speaker direction, two of the three gains vanish and the source is reproduced by that single speaker (the geometric basis of snap). If $\mathbf{p}$ lies on an edge between two speakers, the third gain is zero and VBAP degenerates to the pairwise tangent law. Everywhere else it uses exactly the three nearest speakers, which keeps the phantom as sharp as a three-speaker amplitude method allows. For purely horizontal layouts the triplet reduces to a pair and VBAP becomes 2-D pairwise panning.

Worked example: VBAP gains for a height triplet

Suppose the renderer must place our earlier source — direction vector $\mathbf{p} = (0.47, 0.81, 0.34)$ , already unit length since $r=1$ — using a triplet of speakers: front-left at $(-0.43, 0.75, 0)$ for $(\theta,\phi)=(-30^\circ,0^\circ)$ , front-right at $(0.50, 0.87, 0)$ for $(30^\circ,0^\circ)$ , and a front-height-right at $(0.32, 0.55, 0.77)$ for roughly $(30^\circ, 50^\circ)$ . (These are unit vectors.) Solving $\mathbf{L}\mathbf{g}=\mathbf{p}$ gives raw weights that are largest on the front-right (closest in azimuth and below the target) and the height speaker (supplying the elevation), with a small contribution from front-left. After normalisation the dominant gains land near $\hat{g}_{FR} \approx 0.74$ and $\hat{g}_{FHR} \approx 0.55$ , with $\hat{g}_{FL}$ a few hundredths — exactly the intuition that a source up-and-to-one-side is carried mostly by the near side speaker and the height speaker above it. The precise numbers depend on the exact layout, but the structure — three non-negative gains, energy-normalised, concentrated on the nearest speakers — is universal.

How size and spread engage more speakers

A point source uses two or three speakers. To realise nonzero size the renderer must spread energy over a region. Two common mechanisms:

Multi-direction spreading. The renderer replaces the single source direction $\mathbf{p}$ with a set of $N$ virtual sub-source directions $\{\mathbf{p}_i\}$ distributed over a patch of the sphere whose angular radius equals the requested size. Each sub-source is panned with VBAP, and the resulting gain vectors are summed (in energy) and renormalised:

\hat{g}_k = \frac{\sqrt{\sum_{i=1}^{N} g_k(\mathbf{p}_i)^2}}{\sqrt{\sum_{k}\sum_{i} g_k(\mathbf{p}_i)^2}}.

As size grows, the patch covers more speaker triplets, so more speakers receive energy and the image broadens. At full size the patch covers the whole sphere and every speaker is driven roughly equally — a diffuse, enveloping sound. This connects to the perception of source width and envelopment discussed in direct, diffuse and envelopment: decorrelation and energy spread across many directions are what the ear reads as "wide" rather than "point."

Speaker-set growth by thresholding. Simpler renderers map size to the number of contributing speakers directly: a size of $0$ keeps the panning triplet; increasing size progressively adds neighbouring speakers with tapering gains until, at $1$ , all in-zone speakers contribute.

Rule of thumb

Either way the headline is the same: size trades localisation sharpness for spatial extent by engaging more loudspeakers. That is a genuine encode/decode degree of freedom that channel audio simply cannot express, because a channel feed has no notion of "how many speakers should share this."

Distance and the rest of the chain

Position's radial component $r$ is rendered as distance, not direction. The renderer applies the cues the ear uses for distance (developed in distance and air): an inverse-distance level law $p \propto 1/r$ , high-frequency air absorption increasing with distance, and a change in the direct-to-reverberant ratio by feeding more of the object into the diffuse/reverberant return as $r$ grows. Some renderers also apply near-field gain boost and parallax for very close objects. Distance handling is the part of object rendering most variable between systems, which is one reason two compliant renderers can place the angle of a sound identically yet disagree on its apparent depth.

Beds and objects

What a bed is

Pure object audio — every sound an object — is wasteful for content that is genuinely channel-shaped. Ambiences, reverb returns, music stems mixed to a static surround field, and the diffuse "wash" of a scene do not move and gain nothing from per-source metadata. Representing them as hundreds of static objects would squander rendering slots and authoring effort.

The solution is the bed: a fixed channel-based sub-mix carried alongside the objects, anchored to nominal speaker positions. A bed is, in ADM terms, DirectSpeakers-type content — its channels map to named loudspeaker positions (L, R, C, LFE, Ls, Rs, plus height channels Ltf, Rtf, Ltr, Rtr for a 7.1.4 bed, and so on). The renderer treats each bed channel as a virtual object pinned to its nominal speaker direction: it is panned to the nearest real speakers if that nominal position is not physically present, and routed straight through if it is. So even the bed is rendered through the same machinery, which is what makes one renderer able to handle a hybrid stream.

Why the hybrid model wins

The bed-plus-objects architecture is a deliberate efficiency compromise:

Beds carry the static, diffuse, channel-natural material cheaply and predictably.
Objects carry the moving, discrete, attention-grabbing material with full positional flexibility.

This mirrors how a film mix is actually built: a continuous background ambience (bed) over which dialogue, Foley, and effects (objects) move. It also caps complexity: an author works with a manageable number of objects (tens) plus one bed, instead of hundreds of pseudo-objects. And it degrades gracefully — on a resource-limited renderer, objects above the system's slot count can be folded into the bed (rendered to its channels and summed) so that nothing is lost, only made less individually steerable. The bed is thus both an efficiency tool and the safety net of the whole scheme.

Dolby Atmos architecture, concretely

Dolby Atmos is the most widely deployed object system, and looking at it concretely makes the abstractions tangible.

The production stream and the master

In production, an Atmos mix is built around a stream of up to 128 audio tracks. Conventionally these are partitioned as a 7.1.2 bed (ten channels) plus up to 118 objects, though the split is configurable — the total of bed channels plus objects is what is capped at 128. Each object track carries its mono essence plus a metadata stream (position in the normalised room cube, size, snap, zone, gain) automated over time by the mixing engineer.

The deliverable is not loudspeaker feeds. It is a master file that bundles all the bed channels, all the object essences, and all the metadata. Dolby's interchange master is the ADM BWF file (sometimes called the Dolby Atmos master ADM BWF), i.e. exactly the ITU-R BS.2076 ADM metadata inside a BS.2088 BW64 container described above, with Dolby-specific conventions layered on top. There is also the older proprietary DAMF (Dolby Atmos Master Format, a set of .atmos/.audio/.metadata sidecar files) used by some tools. The point is that the master is layout-independent: it describes the scene, not the speakers.

Rendering targets from one master

From that single master, Dolby renderers derive feeds for an enormous range of systems:

Commercial cinema: up to 64 discrete loudspeaker feeds, with arrays of surround and overhead speakers around the auditorium. The cinema renderer (in the cinema processor) computes per-speaker gains for every object continuously, so a sound can travel smoothly across dozens of physical speakers.
Home: the same content is rendered to consumer layouts described as X.Y.Z — X ear-level speakers, Y LFE, Z height speakers — from 5.1.2 (five ear-level, one LFE, two height) up through 7.1.4 and 9.1.6. A home AV receiver contains the Dolby renderer and adapts to whatever the user has connected and configured.
Soundbars: rendered to two, three, or a handful of drivers, sometimes with beam-forming or virtual-height processing.
Headphones: the renderer produces a binaural mix by convolving each rendered direction with head-related transfer functions, as developed in binaural. This is how Atmos reaches earbuds — every object is spatialised by HRTF to its metadata direction, and the diffuse bed is rendered to a virtual speaker array which is itself binauralised.

For broadcast and streaming, the rendered/encoded result is carried by Dolby Digital Plus with Joint Object Coding (E-AC-3 JOC, "Dolby Atmos" in the consumer sense) or Dolby AC-4, which transmit a core channel mix plus side metadata that a decoder uses to reconstruct objects before the local renderer places them. The encode/decode chain therefore has two stages: a perceptual-coding encode for transport, then a spatial decode (the renderer) for the room.

Worked example: object at $(30^\circ, 20^\circ)$ to 7.1.4 vs a 2.0 soundbar

Take the object we have been tracking: azimuth $30^\circ$ , elevation $20^\circ$ , full radius, unit gain. We render it two ways and compare approximate gains. (Real renderers add frequency-dependent and distance terms; we keep to the amplitude-panning core to show the structure.)

Target A — a 7.1.4 system. The renderer has front-right speakers and right-front-height speakers available. The source sits forward, to one side, and elevated, so it falls in a triplet bounded by Front-Right (FR) at roughly $(30^\circ,0^\circ)$ , Centre/Front-Left to one side, and Right-Front-Height (RFH) at roughly $(45^\circ, 45^\circ)$ . VBAP across that triplet, energy-normalised, yields approximately:

Speaker	Approx. gain	Approx. level (dB)
Front-Right (FR)	$0.80$	$-1.9$
Right-Front-Height (RFH)	$0.58$	$-4.7$
Centre (C)	$0.12$	$-18.4$
all others	$\approx 0$	$-\infty$

The energy check $0.80^2 + 0.58^2 + 0.12^2 = 0.64 + 0.336 + 0.014 \approx 0.99 \approx 1$ confirms the normalisation. The listener hears a fairly sharp phantom forward-right and clearly elevated, because real height speakers carry the elevation cue directly.

Target B — a 2.0 soundbar. Now there are only two ear-level drivers (effectively Left at $-30^\circ$ and Right at $+30^\circ$ ) and no height speakers at all. The renderer must collapse a 3-D position onto a horizontal pair. The elevation cannot be reproduced by geometry, so it is either discarded or faked with a virtualiser (a fixed HRTF/crosstalk filter that tints the sound to suggest height). The azimuth $30^\circ$ maps to the right edge of the soundbar's arc. Using the tangent law with $\theta_0 = 30^\circ$ and target $\theta = 30^\circ$ gives $g_L - g_R = -(g_L+g_R)$ , i.e. $g_L = 0$ , so:

Driver	Approx. gain	Approx. level (dB)
Right	$1.00$	$0.0$
Left	$0.00$	$-\infty$
height	(none — virtualised or dropped)	—

The same object that was a sharp, elevated, forward-right image on 7.1.4 becomes a flat, ear-level, hard-right image on the soundbar — possibly with a spectral "height tint" if the soundbar virtualises.

One master, many realisations

One master, two very different physical renderings, each correct for its system. No re-authoring occurred; the renderer absorbed the entire adaptation. That is the payoff of the object model, and the worked numbers show exactly where information is preserved (azimuth) and where it is necessarily lost (elevation, when no height speakers exist).

Other object systems

Dolby Atmos is not alone. The object idea is general, and several standardised and proprietary systems implement it.

MPEG-H Audio (ISO/IEC 23008-3)

MPEG-H 3D Audio, standardised as ISO/IEC 23008-3, is the most complete open standard for next-generation audio. It is tri-format: a single MPEG-H stream can carry channels, objects, and higher-order Ambisonics (HOA / scene-based audio) simultaneously, and the decoder/renderer combines them. Its defining feature is built-in interactivity and personalisation: the bitstream carries metadata describing presets and interaction ranges, so a listener can, within author-defined limits, raise the dialogue, switch commentary language, or rebalance elements at home. MPEG-H also standardises loudness and dynamic-range metadata and a normative renderer, so reproduction is predictable across decoders. It is the audio system of the ATSC 3.0 broadcast standard (deployed in South Korea and the US) and is used in some streaming and music applications. Architecturally it is the same encode/decode story: author a scene with objects and metadata; transmit; let the receiver's renderer derive feeds for the local layout (including a binaural mode for headphones).

DTS:X

DTS:X is DTS's object-based system, conceptually parallel to Atmos: objects with positional metadata plus channel beds, rendered at playback to the user's layout. Its design philosophy emphasises layout flexibility — DTS:X does not mandate a fixed speaker configuration and aims to adapt to whatever speakers are present, including non-standard placements, and exposes object-level controls (such as dialogue level) to the listener. In cinema, DTS:X competes with Atmos; in the home it is carried in DTS-HD/DTS:X bitstreams on Blu-ray and via streaming. From the renderer's standpoint the mathematics — amplitude panning of objects to the present speakers, beds pinned to nominal positions — is the same family as everything above.

Spatial Audio Object Coding (SAOC)

SAOC (MPEG-D Spatial Audio Object Coding, ISO/IEC 23003-2) attacks a different problem: bitrate. Sending every object as a separate full-quality audio stream is expensive. SAOC instead transmits a downmix (one or two channels) plus compact parametric side information — inter-object level and correlation parameters — that lets the decoder re-separate and re-pan the objects approximately at the receiver. It is a parametric cousin of the object idea: the objects are not literally present in the transport, but enough statistics travel alongside the downmix to reconstruct controllable object-like elements. SAOC is most associated with interactive applications (e.g. teleconferencing, karaoke-style remixing, dialogue enhancement) where exact reconstruction matters less than per-object control at low bitrate. It is the same lineage as the parametric upmix logic we describe in the HSR stereo upmixing work — derive controllable spatial components from a compact carrier rather than shipping each one in full.

The table below summarises the landscape.

System	Standard / owner	Transport types	Headline feature
Dolby Atmos	Dolby (proprietary; ADM BWF master)	bed + objects	Cinema-to-earbud ubiquity, up to 128 tracks
MPEG-H 3D Audio	ISO/IEC 23008-3	channels + objects + HOA	Interactivity, personalisation, normative renderer
DTS:X	DTS / Xperi	bed + objects	Layout-agnostic flexibility
SAOC	ISO/IEC 23003-2	downmix + object parameters	Very low bitrate, per-object control
Auro-3D	Auro Technologies	mainly channel-based (layered)	Height via stacked channel layers
ADM / BW64	ITU-R BS.2076 / BS.2088	channels + objects + HOA + matrix	Open interchange master (not a codec)

(Auro-3D and the ADM appear for context: Auro-3D is predominantly channel-layer-based rather than object-based, and ADM is an interchange model rather than a delivery codec.)

Scalability and adaptation: one master, many systems

The strategic argument for object audio is write once, render everywhere. A single master serves the entire deployment matrix, and each endpoint contributes its own renderer.

The scaling benefit

Consider the channel-based alternative: to serve a 2.0, 5.1, 7.1, 5.1.2, 7.1.4, 9.1.6, and a 64-channel cinema target with discrete masters, a studio would author and quality-check seven mixes, and would still be unable to serve the eighth layout that appears next year. With objects, the studio authors one master and the count of supported layouts is bounded only by the number of renderers in the field — which the studio does not have to build or store. The encoding cost is paid once; the decoding cost is distributed to the playback devices, each of which pays only for its own layout. This is precisely the economic shape of the encode/decode separation that recurs across spatialization techniques: a compact, layout-free encoding amortised across unlimited decodings.

There is a perceptual benefit too. Because the renderer knows the actual speaker positions (including the user's room calibration), it can do better than any layout-blind downmix: it panned to your speakers, not to the nominal ones the mix engineer guessed at. Combined with room calibration (covered in the calibration part of this guide), object rendering can preserve intended angles even when a user's speakers are imperfectly placed.

The cost: you must have a renderer

No renderer, no sound

The flip side is unavoidable: object audio does not play without a renderer. A channel feed is self-describing — feed $L_s$ to the left-surround speaker and you are done; the "renderer" is a wire. An object master is inert until software interprets the metadata.

This has several consequences:

Every endpoint needs the renderer, and renderers differ. Two renderers, both compliant, can produce audibly different results (different panning laws, different size and distance handling, different binaural HRTFs), so "the same master" does not guarantee "the same sound" — only "the same intent." We return to this under limits.
Compute at playback. Rendering 100+ objects continuously to 64 speakers, or HRTF-convolving them for binaural, is real-time signal processing the playback device must afford. Cheap devices cut corners (fewer objects, coarser panning), which is partly why object content can sound markedly better on a capable AV receiver than on a budget soundbar even at the same nominal "Atmos" badge.
Versioning and trust. Because the renderer is part of the reproduction chain, the quality of the renderer is now part of the product, and updates to it change how all existing masters sound. This is a maintenance surface that channel delivery never had.

Authoring workflow and how tools expose objects

Object authoring lives inside the digital audio workstation, with extra machinery for metadata.

The panner and metadata automation

The central tool the engineer touches is the object panner — a plug-in on each object track that exposes the metadata as automatable controls:

a position control, typically a 2-D pan field (a top-down view of the room) plus a separate elevation slider or a 3-D widget, mapping directly to the azimuth/elevation or $(x,y,z)$ metadata;
a size / spread control (sometimes split into width, depth, height);
snap to speaker and zone / speaker-mask toggles;
divergence and per-object gain;
an object on/off assignment that declares whether the track is a bed channel or a discrete object, and which object slot it occupies.

Every one of these is automatable, so a sound can be flown through the room by drawing automation curves exactly as one automates volume — the difference is that the curve drives metadata, not a baked-in pan. The DAW (with the relevant object-audio production suite, e.g. the Dolby Atmos Renderer running locally or integrated, or an MPEG-H authoring plug-in) listens to all these tracks and their metadata and renders a monitoring mix to whatever speaker layout the studio has, in real time, so the engineer hears an approximation of what a consumer renderer will produce. At print time the same system writes the ADM BWF master.

Beds, groups, and re-render outputs

Practical sessions organise tracks into the bed (a fixed-channel sub-master, often 7.1.2) and a pool of object tracks. Engineers route static, diffuse material (reverb returns, ambiences, music stems) to the bed and dynamic, discrete material to objects, exactly along the lines argued earlier. Authoring tools also produce re-renders — automatically derived channel-based versions (2.0, 5.1, 7.1.4, binaural) generated from the object master for QC and for delivery to channel-only pipelines. The re-render is the renderer running in the studio so the engineer can verify that the adaptation — not just the reference layout — sounds right.

Common mistakes and pitfalls

Object authoring introduces failure modes that channel mixing does not have:

Over-objectising. Turning every track into an object exhausts rendering slots, bloats the master, and gains nothing for static material. Put diffuse content in the bed.
Trusting one monitoring layout. A mix that sounds great on the studio's 9.1.6 can collapse on a 2.0 soundbar (recall the worked example: elevation simply vanishes). Always audition the re-renders, especially binaural and 2.0, which are how most consumers will actually hear it.
Snap misuse. Leaving snap on for a moving object makes it jump discretely from speaker to speaker; leaving it off for a centre-anchored dialogue object can blur the dialogue into a phantom. Match snap to intent.
Size as a loudness crutch. Increasing size spreads energy over more speakers and changes perceived level and timbre; using it to make something "bigger" can unbalance the mix on layouts with many speakers.
Metadata that fights the renderer. Specifying positions outside the reproduction volume, or distance values the renderer interprets differently than expected, yields surprises. Know your target renderer's conventions (polar vs Cartesian, how it treats $r>1$ ).
Ignoring loudness/DRC metadata. In MPEG-H and broadcast Atmos, loudness and dynamic-range metadata travel with the content; getting them wrong produces level jumps and over-compression at the consumer, independent of the spatial mix.

Common mistake

Never trust a single monitoring layout. The studio's flagship rig is the best case, not the typical one — so audition the re-renders, especially binaural and 2.0, before you sign off. They are how most listeners will actually hear the mix.

Pros, cons, and where quality really comes from

Advantages

Layout independence and future-proofing. One master serves all current and future systems; new layouts need only a renderer.
Per-source control downstream. Because sources remain separate until the last moment, endpoints can offer personalisation (dialogue boost, language switch, accessibility mixes) — impossible once sources are summed into channels.
Optimal use of the actual array. The renderer pans to the speakers that exist, in the positions they are in, which can beat any layout-blind downmix.
Smooth motion across dense arrays. In cinema, an object can travel continuously across dozens of speakers, which channel panning between a handful of fixed feeds cannot match.

Disadvantages

Renderer dependence. Nothing plays without a renderer, and renderers differ, so the experience is not fully determined by the master.
Compute and complexity at playback. Real-time rendering of many objects (and HRTF binauralisation) costs processing the endpoint must provide.
Metadata burden in authoring. Every object carries time-varying metadata to create, automate, and quality-check; the production surface is larger than a channel mix.
Variable consumer reality. A "Dolby Atmos" badge spans everything from a flagship 9.1.6 room to a two-driver phone, and the same master can sound transformative or nearly flat depending on the endpoint — a marketing-versus-physics gap the format cannot close.

Quality lives in the panner

Key takeaway

The decisive technical point: the audible quality of object audio is the quality of the renderer's panner, not of the metadata.

The metadata is just a target direction and size; whether that target becomes a sharp, stable, correctly-coloured phantom depends entirely on the panning law (pairwise vs VBAP vs Ambisonic decoding), the speaker layout, and the listener's position relative to it. All the localisation limits we met in amplitude panning — image blur off the sweet spot, the precedence-driven pull toward the nearest speaker, timbral coloration from comb filtering between speakers — reappear unchanged, because the renderer is an amplitude panner. The object model does not transcend the physics of phantom imaging from psychoacoustics; it merely defers the panning decision to playback time. A brilliant master through a crude renderer sounds crude; the format guarantees the intent survives, not the result.

The key point: rendering is encode then decode

Step back and the whole chapter collapses into the single idea that organises all of Part II.

When an engineer places an object at azimuth $30^\circ$ , elevation $20^\circ$ , size $0.2$ , that placement is an encoding of a sound-field intention into a compact, basis-independent representation. The representation is not loudspeaker feeds and not a captured field; it is a description — direction, extent, level — of what should reach the listener. It is deliberately abstract precisely so that it commits to no particular speaker layout, exactly as a B-format Ambisonic signal or a stereo difference-signal commits to no particular layout.

The renderer is the decoder. It takes that abstract description and the known geometry of the present system and derives the feeds: VBAP gains for a 64-speaker cinema, three-speaker triplet gains for a height-equipped living room, a tangent-law pair for a soundbar, HRTF convolutions for headphones. Each derivation is the same operation — solve for the loudspeaker (or ear-signal) weights that reconstruct the encoded intent on this system — and each is mathematically a projection of the encoded representation onto the basis of the available reproduction channels.

This is, line for line, the encode/decode structure of amplitude panning (encode = target angle, decode = pair gains via the tangent law), of Ambisonics (encode = spherical-harmonic coefficients, decode = the decoder matrix for the rig), and of binaural (encode = direction, decode = HRTF pair). It is why we insist in stereo is already spatial that the simplest two-speaker pan and the most elaborate object renderer are the same idea at different scales: author a field into a finite representation; derive the feeds for the system present. Object audio is the limiting case where the representation is made as explicit and as layout-free as possible — a literal list of "what is where" — so that the decode can be deferred all the way to the consumer's living room. Everything else in the format — beds, ADM, BW64, snap, zones, Atmos, MPEG-H, DTS:X — is engineering in service of that one move.

Seen this way, object-based audio is not a different paradigm from the rest of spatialization. It is the cleanest expression of the paradigm: separate the description of the field from its reproduction, push the decode as late as possible, and let every system reconstruct the same intent in the best way its own geometry allows. The cost — that you must always carry a decoder — is the price of never again being locked to a layout.

References

Pulkki, V. (1997). "Virtual Sound Source Positioning Using Vector Base Amplitude Panning." Journal of the Audio Engineering Society, 45(6), 456–466.
ITU-R Recommendation BS.2076 (2019). Audio Definition Model. International Telecommunication Union, Geneva.
ITU-R Recommendation BS.2088 (2019). Long-form file format for the international exchange of audio programme materials with metadata (BW64). International Telecommunication Union, Geneva.
ITU-R Recommendation BS.775 (2012). Multichannel stereophonic sound system with and without accompanying picture. International Telecommunication Union, Geneva.
ISO/IEC 23008-3:2019. Information technology — High efficiency coding and media delivery in heterogeneous environments — Part 3: 3D audio (MPEG-H 3D Audio). International Organization for Standardization, Geneva.
ISO/IEC 23003-2:2010. Information technology — MPEG audio technologies — Part 2: Spatial Audio Object Coding (SAOC). International Organization for Standardization, Geneva.
Dolby Laboratories (2018). Dolby Atmos Next-Generation Audio for Cinema and Dolby Atmos Home Theater Installation Guidelines. Dolby Laboratories, San Francisco.
Herre, J., Hilpert, J., Kuntz, A., & Plogsties, J. (2015). "MPEG-H Audio — The New Standard for Universal Spatial / 3D Audio Coding." Journal of the Audio Engineering Society, 62(12), 821–830.
Rumsey, F. (2001). Spatial Audio. Focal Press, Oxford.

← Back to Spatialization Techniques

From channel lock to the object idea​

Why channel-based audio is locked to a layout​

The object idea: audio plus positional metadata​

The metadata model in detail​

Position: coordinate systems​

Worked example: polar to Cartesian​

Size and spread​

Gain, snap, and divergence​

Zones, exclusion, and importance​

The Audio Definition Model and the BW64 carrier​

The renderer: turning metadata into per-speaker gains​

Amplitude panning recap​

VBAP: vector base amplitude panning​

Worked example: VBAP gains for a height triplet​

How size and spread engage more speakers​

Distance and the rest of the chain​

Beds and objects​

What a bed is​

Why the hybrid model wins​

Dolby Atmos architecture, concretely​

The production stream and the master​

Rendering targets from one master​

Worked example: object at (30∘,20∘)(30^\circ, 20^\circ)(30∘,20∘) to 7.1.4 vs a 2.0 soundbar​

Other object systems​

MPEG-H Audio (ISO/IEC 23008-3)​

DTS:X​

Spatial Audio Object Coding (SAOC)​

Scalability and adaptation: one master, many systems​

The scaling benefit​

The cost: you must have a renderer​

Authoring workflow and how tools expose objects​

The panner and metadata automation​

Beds, groups, and re-render outputs​

Common mistakes and pitfalls​

Pros, cons, and where quality really comes from​

Advantages​

Disadvantages​

Quality lives in the panner​

The key point: rendering is encode then decode​

References​