Post-Production: Film, TV & Streaming

Post-production is where spatial audio becomes a product. A film, an episodic series, or a streaming title arrives in the mix room as a pile of recorded and edited stems, and it leaves as a small set of files that must play correctly on a cinema processor, a soundbar, a phone with earbuds, and a legacy stereo television — all from the same creative intent. This chapter walks the practitioner through that journey end to end: the immersive landscape and its formats, the calibrated room that makes mixing trustworthy, the spatial strategy for placing dialogue, music and effects, the renderer that maps your mix to every target, the deliverables you actually ship, the loudness rules that govern delivery, and the quality-control pass that keeps a show from being rejected.

Everything here builds on earlier parts. The object-based machinery comes from object-based audio and the format taxonomy from formats. The room is governed by measurement and calibration and equalization and room correction. The creative use of space draws on reverberation and direct, diffuse and envelopment. And because most of the world still listens in stereo or 5.1, the recurring truth that stereo is already spatial governs the fold-down and downmix work that occupies more of your time than you would expect.

The immersive post landscape

For most of film history, the deliverable was a channel format: a left-centre-right-surround mix baked to specific loudspeakers, then matrixed down (see surround and matrix) for the home. The immersive era inverts this. Modern post is dominated by Dolby Atmos, which represents a mix not as a fixed set of channels but as a bed plus objects: a static channel bed (typically $7.1.2$ — seven horizontal channels, one LFE, two height channels) carrying anything that is naturally diffuse or omnipresent, plus up to a few hundred dynamic objects, each a mono audio stream with time-varying positional metadata $(x, y, z)$ . This is the object-based paradigm of object-based audio, applied at scale.

The decisive consequence: the mix is renderer-agnostic. You author once against a virtual room, and a renderer in the playback device decides how to reproduce each object on the loudspeakers actually present — whether that is a $9.1.6$ dub stage, a $5.1.4$ home theatre, a $2.0$ soundbar, or a pair of earbuds running binaural. The mix engineer's job shifts from "make it sound right on these speakers" to "make it sound right on every speaker, and verify that."

The three delivery contexts

The same Atmos master feeds three distinct ecosystems, and the practitioner must hold all three in mind from the first fader move:

Cinema (theatrical). Dolby Atmos for cinema uses up to $64$ speaker feeds rendered from the object master by the cinema processor. Reference monitoring is loud — around $85\ \text{dB}$ SPL per main channel (pink noise at $-20\ \text{dBFS}$ RMS) — and the room follows a defined high-frequency roll-off (the X-curve).
Home / streaming. The same content is encoded as Dolby Digital Plus with Joint Object Coding (E-AC-3 JOC) or AC-4 IMS for streaming, carrying the bed and a compressed object representation that the consumer device's renderer unpacks. Reference level is lower, around $79\ \text{dB}$ SPL per channel for a near-field room.
TV broadcast. Increasingly Atmos-capable but bound by broadcast loudness law (ATSC A/85 in North America, EBU R128 in Europe), and still requiring a robust $5.1$ and stereo fold-down for the large installed base of legacy receivers.

Key takeaway

In immersive post you are not making "a mix." You are authoring a spatial intent that a renderer will realise on hardware you have never seen. The whole workflow — calibration, monitoring, QC — exists to make that intent survive translation.

The mix room

A reference environment, not a nice-sounding room

A mix is only as trustworthy as the room it was made in. Immersive post rooms are built to a reference specification so that decisions made in Los Angeles, London and Tokyo translate. The governing standard for critical listening rooms is ITU-R BS.1116, which constrains room dimensions and modal distribution, reverberation time, background noise (around NR15, roughly $\text{NR}\,15$ corresponds to very low ambient noise), and early-reflection control. Layout geometry — speaker angles and elevations for $5.1$ , $7.1$ and immersive configurations — comes from ITU-R BS.2051 and is the practical subject of speaker layouts and topologies.

The goal is a room whose own acoustic signature is as neutral and as quiet as practical, so that what you hear is the mix and not the room. Early reflections are absorbed or diffused (see direct, diffuse and envelopment); a typical critical room targets a broadband reverberation time on the order of $T_{60} \approx 0.2\text{–}0.4\ \text{s}$ scaled to volume, deliberately drier than a domestic living room so that the mix's reverb — not the room's — defines the space.

Reference levels and the X-curve

Calibration sets the absolute relationship between the digital signal and acoustic level, so that a fader at unity means the same loudness everywhere. The procedure, detailed in measurement and calibration:

Feed band-limited pink noise at $-20\ \text{dBFS}$ RMS to one channel at a time.
Measure with a calibrated SPL meter (C-weighted, slow) at the mix position.
Trim each channel's gain to the target.

For theatrical mixing each screen and surround channel is set to $85\ \text{dB}$ SPL ( $\text{C}$ , slow) for $-20\ \text{dBFS}$ pink noise. The LFE is set $+10\ \text{dB}$ hotter in-band to give the low-frequency channel its characteristic headroom. Surround arrays in cinema are individually lower because there are many of them summing.

Theatrical rooms also impose the X-curve (defined in SMPTE ST 202 / ISO 2969): a deliberate high-frequency roll-off of the in-room response, beginning around $2\ \text{kHz}$ and falling roughly $3\ \text{dB/octave}$ to the top, intended to match the average large-room response of real cinemas so the mix translates to theatres. This is target-curve room correction, the cinema-specific cousin of the practices in equalization and room correction. Critically, the X-curve is a cinema convention — home/near-field Atmos rooms do not use it; they aim for a flatter, gently-tilted in-room target.

For near-field / home monitoring (and most TV and streaming work), each channel is typically calibrated to $79\ \text{dB}$ SPL for $-20\ \text{dBFS}$ pink noise, reflecting the shorter listening distance and lower playback levels of domestic rooms. Some facilities use $76\ \text{dB}$ for very near-field setups.

Context	Per-channel ref level	Test signal	Room target	LFE
Theatrical (dub stage)	$85\ \text{dB}$ SPL (C, slow)	$-20\ \text{dBFS}$ pink, band-limited	X-curve (ST 202)	$+10\ \text{dB}$ in-band
Home / streaming near-field	$79\ \text{dB}$ SPL	$-20\ \text{dBFS}$ pink	Flat / gentle tilt	$+10\ \text{dB}$ in-band
Broadcast TV post	$79\ \text{dB}$ SPL	$-20\ \text{dBFS}$ pink	Flat / gentle tilt	$+10\ \text{dB}$ in-band

Per-channel calibration and time alignment

Level matching is necessary but not sufficient. Every loudspeaker in an immersive array must also be time-aligned and phase-coherent, or phantom images between speakers smear and overhead pans collapse. Using the methods of time alignment and phase, each speaker's distance to the mix position is measured and delays trimmed so that an impulse arrives simultaneously from all of them. A $30\ \text{cm}$ difference in path length is

t = \frac{d}{c} = \frac{0.30\ \text{m}}{343\ \text{m/s}} \approx 0.87\ \text{ms},

enough to pull a phantom image audibly off-centre and to comb-filter correlated content. Bass management routes everything below the crossover (commonly $80\text{–}120\ \text{Hz}$ ) to the subwoofer(s); see subwoofers and bass management. In an Atmos room the bed LFE and the bass-managed low end of full-range objects must be reconciled so that you are not double-counting low frequency.

warning

If your room is not calibrated to a reference level and target curve, every spatial and tonal decision you make is a guess that will not survive the renderer or the consumer's living room. A $+3\ \text{dB}$ surround imbalance you can't hear becomes an obvious error on a properly aligned home system — and a rejection from streaming QC.

The spatial mixing strategy

Dialogue is anchored

The single most important rule in narrative post: dialogue must be intelligible and stable. In the immersive model, on-screen dialogue is typically placed as an object anchored to the centre, or as a tightly-controlled object that tracks a character's on-screen position only when the picture motivates it. The centre channel exists precisely so that dialogue does not depend on a phantom image — every seat in a cinema, on- or off-axis, hears speech from a real loudspeaker behind the screen.

Two practical patterns:

Anchored centre. Most dialogue lives as a centre-positioned object or in the bed's centre channel, EQ'd and de-essed, gently compressed, and sitting at a consistent loudness that will satisfy dialogue-gated loudness measurement (below).
Spatial dialogue (sparingly). A character walking across frame can have their voice as a moving object, but only when it stays intelligible and the picture justifies it. Off-screen and crowd voices spread more freely.

Music and ambience spread; effects become objects

The reason to mix in immersive at all is envelopment — wrapping the listener in a believable acoustic world (see direct, diffuse and envelopment). The craft is matching each element's spatial behaviour to its dramatic role:

Music is usually spread across the bed and front-wide objects, sometimes with reverb returns placed in the surrounds and heights to lift it off the screen. Score reverb in the height layer creates a sense of overhead acoustic space without distracting from dialogue.
Ambience (room tone, weather, traffic, crowd) is the diffuse fabric — distributed across surrounds and heights as bed content or wide objects so it surrounds without localising.
Hard effects (a passing car, an arrow, a helicopter) are objects with trajectories. A helicopter that flies from rear-left up and over to front-right is exactly the case object metadata was designed for, and exactly the case that channel-based mixing could never do convincingly.

Using the room and reverb

Spatial depth in a mix comes from controlled use of reverberation, drawing on reverberation and distance and air. Three levers create depth:

Direct-to-reverberant ratio. A near source is dry and loud; a distant source is quieter with more reverb. Pulling an object back in the scene means lowering its level and increasing its send to a reverb whose returns are placed in the surround/height field.
Pre-delay and early reflections. Short pre-delay glues a source to a small space; longer pre-delay implies a large hall. Early-reflection patterns sell the size of the room the action is in.
Air absorption and HF roll-off. Distant sources lose high frequency to air absorption; a gentle low-pass on a far object reinforces distance, mirroring the physics in distance and air.

Reverb returns are themselves often placed as bed content or wide objects so that the space is enveloping even when the dry source is a focused object. This is where the immersive format earns its keep: the dry helicopter is a moving point; the reverberant tail of the canyon it flies through wraps the entire room.

Rule of thumb

Mix the dry element as an object for localisation, and place its reverb return in the bed or wide objects for envelopment. Localisation and envelopment are different perceptual jobs — give each its own spatial home.

The renderer and monitoring

How the Atmos renderer maps the mix

At the heart of an Atmos session is the renderer (the Dolby Atmos Renderer software, or a renderer integrated into the DAW). It receives the bed channels and every object's audio-plus-metadata and, in real time, computes loudspeaker feeds for the monitoring layout you have told it you have — $7.1.4$ , $9.1.6$ , and so on. Objects are placed by amplitude panning between the nearest speakers (the same vector-base amplitude logic discussed in amplitude panning), with the renderer handling the math so that a position $(x,y,z)$ produces a stable image regardless of how many speakers the room has.

The renderer also performs the re-render to every standard layout: from one authoring session it can output $7.1.2$ , $5.1.4$ , $5.1$ , and $2.0$ , plus a binaural stream — continuously, while you mix. This is the practical realisation of the format-agnostic promise: you are always one click from hearing what a smaller room will get.

Monitoring the binaural render

A large and growing share of streaming Atmos is consumed on headphones, where the renderer's binaural output applies HRTF-based processing (see binaural) to fold the entire 3D scene into two ears. The Atmos renderer lets you assign each object and bed channel a binaural "distance" mode (near / mid / far / off) that trades intimacy against spaciousness and externalisation.

You must monitor this. A mix that is gorgeous on a $9.1.6$ stage can sound smeared, phasey or fatiguing on headphones if height objects pile up or if dialogue gets pushed too far in the binaural distance model. Practical checks:

Audition the binaural render on the reference earbuds/headphones your audience actually uses (the renderer's default HRTF approximates the consumer chain).
Confirm dialogue stays centred and externalised, not "inside the head."
Verify that big overhead moves still read as overhead and not just as a tonal shift.

Checking the downmixes live

The renderer's simultaneous 5.1 and 2.0 downmix outputs let you catch fold-down problems while you can still fix them. Monitor the $5.1$ re-render and the $2.0$ (Lo/Ro and Lt/Rt) fold-downs periodically, listening specifically for content that disappears or doubles when channels sum — the subject of the next sections.

Renderer output	Use	Watch for
$7.1.4$ / $9.1.6$ beds+objects	Primary monitoring	Object localisation, height balance
$5.1.4$ / $5.1.2$ re-render	Home theatre check	Height fold to fewer overheads
$5.1$ downmix	Broadcast/legacy	Surround sum, LFE management
$2.0$ (Lt/Rt, Lo/Ro)	Stereo / mobile	Phase cancellation, dialogue level
Binaural	Headphone streaming	Externalisation, dialogue centring, fatigue

Deliverables

The ADM BWF master is the source of truth

The primary immersive deliverable is the ADM BWF: a Broadcast Wave file (.wav) whose audio essence carries all bed and object streams, and whose Audio Definition Model (ADM) metadata chunk (ITU-R BS.2076) describes the bed configuration and every object's positional automation over time. This single file is the master — losslessly re-renderable to any target — and is what you archive and hand to the encoder. In Dolby workflows it is sometimes wrapped/validated as the Dolby Atmos master file set (the .atmos family: .atmos, .atmos.audio, .atmos.metadata), which the encoder turns into the distribution codecs (DD+ JOC, AC-4).

From that master you generate (or the renderer generates) the channel-based deliverables that legacy chains need.

A typical deliverables table

A streaming title might require the following package. Exact specs vary per distributor, so always work to the platform's current delivery spec — but the shape is consistent:

Deliverable	Format	Loudness target	True-peak max	Notes
Immersive master	ADM BWF (bed + objects)	Integrated per spec (often $-27$ to $-24$ LKFS dialog-gated)	$-1\ \text{dBTP}$	Source of all renders
$5.1$ printmaster	6-ch WAV / stems	$-24\ \text{LKFS}$ (A/85) or $-23\ \text{LUFS}$ (R128)	$-2\ \text{dBTP}$ (broadcast)	Conform to picture
$2.0$ stereo (Lt/Rt)	Stereo WAV	Same as 5.1	$-1$ to $-2\ \text{dBTP}$	Matrix-encoded surround
$2.0$ stereo (Lo/Ro)	Stereo WAV	Same	$-1\ \text{dBTP}$	Simple fold-down
Binaural	Stereo WAV (renderer)	Per platform	$-1\ \text{dBTP}$	Headphone deliverable
M&E / stems	DME splits	Per spec	$-1\ \text{dBTP}$	Dialogue, Music, Effects for dub/localisation

The DME stems (Dialogue, Music, Effects) deserve emphasis: foreign-language dubbing replaces only the dialogue stem, so the M&E (music-and-effects) must be a complete, fully-mixed bed without dialogue. In immersive delivery this means delivering the bed+objects split by D, M and E so a localisation house can re-author with new dialogue objects.

Downmix safety and fold-down

Every channel-based deliverable is produced by fold-down: summing the immersive scene into fewer channels. The danger is phase cancellation and level build-up. Two classic failure modes:

Surround-to-stereo collapse. A sound panned hard to a surround that becomes anti-phase in the Lt/Rt matrix can vanish in a mono fold-down — the foundation of surround and matrix. Test by summing Lt+Rt to mono and listening for disappearing effects.
Dialogue level shift. Centre content sums into both L and R of a stereo fold-down at a defined attenuation (commonly $-3\ \text{dB}$ ); get the coefficient wrong and dialogue is too loud or too soft relative to the immersive master.

The renderer applies standard downmix coefficients, but you remain responsible for auditioning the result. A height object folding down to the horizontal plane, a wide ambience folding to stereo — each can shift balance in ways the meter does not show.

danger

Never assume the automatic downmix is safe. The most common rejection in streaming QC is a $5.1$ or stereo fold-down where an effect cancels or dialogue jumps level. Audition every fold-down to mono and stereo before delivery — the five seconds it takes is cheaper than a re-delivery and a missed air date.

Loudness and true-peak

Why loudness rules exist, and what they measure

Channel counts and panning are creative; loudness is law. Broadcasters and streamers enforce loudness so that programmes and ads match, and so the viewer never reaches for the remote. All modern loudness measurement descends from ITU-R BS.1770, which defines:

K-weighting — a filter approximating perceived loudness, emphasising the presence region and rolling off low end.
Gating — an absolute gate at $-70\ \text{LUFS}$ plus a relative gate $10\ \text{dB}$ below the ungated level, so silences and quiet passages do not drag the measured loudness down.
Integrated loudness in LUFS/LKFS (the units are identical: $1\ \text{LU} = 1\ \text{dB}$ ).
True-peak measurement via oversampling, in $\text{dBTP}$ , catching inter-sample peaks that a sample-peak meter misses.

The two regimes

Regime	Standard	Integrated target	Tolerance	True-peak max	Gating
Europe / EBU	EBU R128 (uses BS.1770)	$-23\ \text{LUFS}$	$\pm 0.5\ \text{(or }\pm1\text{) LU}$	$-1\ \text{dBTP}$	Program-gated
North America / ATSC	ATSC A/85	$-24\ \text{LKFS}$	$\pm 2\ \text{LKFS}$	$-2\ \text{dBTP}$ (typical)	Dialogue-gated
Streaming (typical)	Platform-specific	$\sim -27$ to $-24\ \text{LKFS}$ (dialog-gated, Atmos)	per platform	$-1\ \text{dBTP}$	Dialogue-gated

Two subtleties matter enormously in practice:

Dialogue gating. ATSC A/85 (via the "dialnorm" concept and the Dolby Dialogue Intelligence engine) measures loudness primarily over dialogue, on the principle that viewers judge programme loudness by speech. EBU R128 historically gates on the whole programme. So the same mix can read differently under the two standards, and a dialogue-anchored mixing strategy is not just good craft — it is what makes dialogue-gated loudness stable.
Atmos loudness. Immersive loudness is measured on a defined downmix (the renderer's $5.1$ or stereo re-render), not by summing all immersive channels, because more channels would otherwise inflate the reading. Always meter the deliverable's specified render, with the platform's specified gating.

A worked loudness example

Suppose you have finished a $7.1.4$ Atmos mix of a streaming drama scene and must hit a streaming spec of $-24\ \text{LKFS}$ integrated, dialogue-gated, true-peak $\le -1\ \text{dBTP}$ .

Meter the correct render. Route the renderer's $5.1$ (or stereo, per spec) re-render into a BS.1770 meter with dialogue gating enabled.
Read the integrated value. The meter reports $-26.8\ \text{LKFS}$ integrated over the programme — too quiet by $24 - 26.8 = -2.8$ , i.e. the programme is $2.8\ \text{LU}$ below target.
Apply a static offset. Raise the master output by $+2.8\ \text{dB}$ . Because LUFS scales $1:1$ with gain, the new integrated reads $-24.0\ \text{LKFS}$ . Re-measure to confirm; never trust a single-pass estimate on dynamic material.
Check true-peak after the gain. The $+2.8\ \text{dB}$ lift pushes the true-peak meter to $-0.4\ \text{dBTP}$ — over the $-1\ \text{dBTP}$ limit by $0.6\ \text{dB}$ .
Resolve the conflict. You cannot simply turn down (that breaks loudness). Insert a true-peak limiter with ceiling $-1.0\ \text{dBTP}$ and minimal gain reduction, or back off the offending peaks. After limiting, re-meter both: integrated $-24.0\ \text{LKFS}$ , true-peak $-1.0\ \text{dBTP}$ .
Verify dialogue consistency. Spot-check that the dialogue-gated reading is dominated by speech and not by a loud effects passage, which would mean your dialogue is too quiet relative to effects — a creative fix (raise dialogue, lower effects), not a mastering fix.

tip

Set loudness with a static gain offset, then control true-peak with a transparent limiter at the very end of the chain. Loudness is a long-term average and responds to a single trim; true-peak is a short-term event and needs a limiter. Confusing the two — riding a limiter to hit loudness — squashes your dynamics and still misses target.

Handling stereo and legacy material

Most of what you mix is not immersive

A recurring reality of post: the picture cut hands you music delivered as stereo stems, library effects in stereo or mono, archival material in $5.1$ at best, and production dialogue in mono. The immersive master must absorb all of it gracefully. This is the practical face of stereo is already spatial — a stereo stem already encodes width and depth via inter-channel level and time differences, and your job is to place that spatial information in the immersive scene without destroying it.

Strategies for placing stereo in the immersive field

Front-wide objects. The cleanest approach: map a stereo stem to two objects at, say, $\pm 30°$ front-wide, preserving its internal imaging while letting it breathe wider than the screen channels. The stereo's phantom centre stays centred; its width stays intact.
Upmix to the bed/heights. When you need a stereo music bed to envelop, an upmixer (decorrelation- and matrix-based) can derive surround and height content from the stereo source. Used judiciously this adds spaciousness; overused it smears transients and collapses the front image. See the DAM treatment in HSR stereo upmixing for multi-speaker systems for the level-controlled approach.
Stereo reverb returns to surrounds/heights. Often the best use of a stereo source's spatial energy is to send a stereo reverb into the surround/height field as envelopment, keeping the dry stereo focused up front.
Mono dialogue to centre object. Production dialogue is mono and belongs as a centre-anchored object; never artificially "stereo-ise" dialogue, which only invites fold-down phase problems.

Preserving downmix safety from the start

The discipline: build immersive from material that already folds down safely. If your stereo placements and upmixes are decorrelated in a way that cancels in mono, that error propagates into every downmix deliverable. Audition stereo placements in mono as you make them, not at the end. A stereo music bed that sounds wide in the room but vanishes in mono fold-down has decorrelation that is too aggressive — pull it back.

note

The immersive deliverable is rarely "immersive material." It is mostly stereo and mono sources skilfully placed in a 3D scene. Mastering the translation of legacy material is the everyday craft of post; the spectacular overhead object is the exception.

QC and conform

Check every layout, not just the one you mixed on

Quality control is the gate before delivery, and the cardinal rule is that you QC every rendered layout the deliverable requires, because each is a different translation that can fail independently. A pass that only checks the $9.1.6$ master is incomplete.

A practical QC checklist per project:

Immersive master integrity. Confirm the ADM BWF opens, the object count and bed config match the session, and the positional metadata plays back identically to the session (re-render and null against the live renderer if possible).
Every channel re-render. Audition $7.1.4 \to 5.1.4 \to 5.1 \to 2.0$ . Listen for: height content folding sensibly, surround balance, dialogue intelligibility holding in each.
Mono fold-down. Sum to mono and listen for cancellation and dialogue level — the single highest-yield QC test.
Binaural. Audition the binaural render on representative headphones for externalisation, dialogue centring, and listening fatigue.
Loudness and true-peak. Meter the specified render with the specified gating; confirm integrated within tolerance and true-peak under ceiling.
Conform to picture. Verify sync, that the audio matches the delivered picture cut (not an earlier one), head/tail handles, and any required sync-pop / 2-pop alignment.
Technical hygiene. Check sample rate ( $48\ \text{kHz}$ standard for film/TV), bit depth ( $24$ -bit), channel order/assignment, file naming to spec, and absence of clicks, dropouts and digital overs.

Conform: matching the master to the final picture

Conform is the often-painful step of reconciling the audio master with the final picture edit. Picture editorial frequently changes after the mix begins; the audio must be conformed to the locked cut — moving, trimming and re-syncing scenes so the delivered audio matches the delivered video frame-for-frame. Every conform invalidates prior loudness and QC, so loudness metering and the full layout check are repeated after the final conform, never before.

warning

Loudness and QC are valid only for the exact file you measured. If picture changes and you conform, your earlier loudness pass is void. Re-meter and re-QC the final conformed render — a surprising number of rejections are mixes that were compliant before a late picture change and never re-checked.

Putting it together: a short film, end to end

Consider a $9$ -minute short film delivered to a streaming platform requiring: an ADM BWF Atmos master, a $5.1$ printmaster, a stereo (Lt/Rt) fold-down, and a binaural headphone version, all at $-24\ \text{LKFS}$ dialogue-gated, true-peak $\le -1\ \text{dBTP}$ , $48\ \text{kHz}/24$ -bit.

Step 1 — Room and session setup

The mix is done on a $7.1.4$ near-field stage calibrated to $79\ \text{dB}$ SPL per channel for $-20\ \text{dBFS}$ pink noise, every speaker time-aligned (max path difference under $0.3\ \text{m}$ , i.e. under $\approx 0.9\ \text{ms}$ ), bass-managed at $80\ \text{Hz}$ , flat in-room target (no X-curve — this is home/streaming, not theatrical). The Atmos renderer is set to a $7.1.4$ monitoring layout with the $5.1$ , $2.0$ and binaural re-renders armed.

Step 2 — Spatial mix

Dialogue: mono production dialogue as a centre-anchored object, de-essed and gently compressed, sitting around a consistent short-term loudness so dialogue-gated metering will be stable.
Music: stereo score stems placed as front-wide objects at $\pm 30°$ , with a stereo plate reverb sent to surround/height objects for lift. Auditioned in mono to confirm no cancellation.
Ambience: forest room tone as bed surround/height content, fully diffuse.
Hard effects: a passing vehicle automated as an object travelling rear-left to front-right; its reverb tail placed wide for envelopment, drawing on direct, diffuse and envelopment. Distance handled with level + HF roll-off + reverb send per distance and air.

Step 3 — Render and monitor

While mixing, the engineer periodically switches the renderer monitor to $5.1$ , then $2.0$ , then binaural. Two issues surface: in the $2.0$ fold-down a hard-surround insect effect cancels (decorrelation too aggressive — narrowed and re-checked), and in binaural the score sits "inside the head" (its binaural distance mode changed from near to mid, restoring externalisation). Both fixed before printing.

Step 4 — Loudness pass

After picture lock and final conform, the engineer meters the renderer's $5.1$ re-render with dialogue gating:

Initial integrated reading: $-25.6\ \text{LKFS}$ . Target $-24$ , so apply $+1.6\ \text{dB}$ static offset.
Re-meter: $-24.0\ \text{LKFS}$ . True-peak now reads $-0.7\ \text{dBTP}$ , over the $-1\ \text{dBTP}$ ceiling.
Insert true-peak limiter at $-1.0\ \text{dBTP}$ ceiling. Re-meter: integrated $-24.0\ \text{LKFS}$ , true-peak $-1.0\ \text{dBTP}$ . Compliant.

The same offset/limiter chain is verified on the stereo and binaural renders, since each has its own peak behaviour; the stereo Lt/Rt true-peak reads slightly hotter due to summing and gets its own limiter instance, landing at $-1.0\ \text{dBTP}$ , integrated $-24.0\ \text{LKFS}$ .

Step 5 — Generate and QC deliverables

ADM BWF master exported from the renderer; opened in a validator, object count and bed config confirmed, positional metadata spot-checked against the session.
$5.1$ printmaster and stereo Lt/Rt rendered; both auditioned, then summed to mono — no cancellations, dialogue level holds.
Binaural rendered and auditioned on reference earbuds: dialogue centred and externalised, vehicle move reads as a clear left-to-right arc, no fatigue.
Loudness/true-peak re-metered on every render after conform (per the warning above). All within tolerance.
Conform/sync verified against the delivered picture; sync-pops aligned; head/tail handles correct; file naming and channel order to platform spec; $48\ \text{kHz}/24$ -bit confirmed.

Final package: one ADM BWF master at $-24\ \text{LKFS}$ / $-1\ \text{dBTP}$ , a $5.1$ printmaster, a stereo Lt/Rt, and a binaural stereo, all conformed to the locked picture and QC'd individually. Delivered.

Common mistakes and limits

The recurring failures

Mixing in an uncalibrated room. No reference level, no target curve, no time alignment — the most fundamental and most common error. Every downstream decision inherits the room's lies. Calibrate per measurement and calibration before touching a fader.
Applying the cinema X-curve to home/streaming content (or vice versa). The X-curve is a theatrical convention; using it for a near-field home deliverable makes the mix dull, and a flat near-field mix played theatrically can sound harsh. Match the target curve to the delivery context.
Never checking binaural or downmix. The mix sounds spectacular on the $9.1.6$ stage and falls apart on earbuds or in mono. Monitor the renders throughout, not at the end.
Loudness measured on the wrong render or wrong gating. Metering the immersive sum instead of the specified downmix, or program-gating when the spec is dialogue-gated, yields a "compliant" file that fails platform QC. Meter exactly what the spec says.
True-peak ignored. Hitting integrated loudness by pushing a limiter, or delivering at $0\ \text{dBFS}$ sample-peak that is actually $+0.8\ \text{dBTP}$ inter-sample, causes clipping in the consumer's lossy codec. Always meter true-peak and leave the specified headroom.
Re-QC skipped after conform. A late picture change voids your loudness and QC pass; many rejections are mixes that were compliant before the final conform.
Over-aggressive upmixing of stereo. Decorrelating stereo into the surrounds and heights until transients smear and the front image collapses — and until it cancels in mono. Place stereo conservatively; verify mono.

Limits of the workflow

The bed+objects model is powerful but not magic. Object counts are bounded by the distribution codec (consumer DD+ JOC carries far fewer discrete objects than the dub stage authored, so the encoder clusters objects — a fine spatial trajectory can be coarsened in delivery; check the encoded result, not just the master). The binaural render is an approximation using a generic HRTF; it will never match an individually-measured HRTF, and externalisation on cheap earbuds is imperfect — design for robustness, not for the ideal listener (see the trade-offs in binaural). And every fold-down is lossy: the $2.0$ deliverable cannot reconstruct the immersive scene, so the mix must be creatively complete at every layout, not only at the top.

Finally, the spec is a moving target. Streaming platforms revise their delivery and loudness specifications regularly; the numbers in this chapter are representative and standards-anchored, but always work to the distributor's current delivery document for the specific title. The standards below are the stable foundation under those evolving specs.

End-to-end takeaway

The immersive master is the source of truth, but the deliverable that fails QC is always a translation — a downmix, a binaural render, a loudness reading on the wrong gate. Calibrate the room, mix dialogue-anchored, monitor every render continuously, and re-QC after conform. The format gives you the space; discipline makes it survive the trip to the listener.

References

ITU-R BS.1116-3, Methods for the subjective assessment of small impairments in audio systems (reference listening room specification). International Telecommunication Union, 2015.
ITU-R BS.2051-3, Advanced sound system for programme production (immersive loudspeaker layouts). International Telecommunication Union, 2022.
ITU-R BS.1770-5, Algorithms to measure audio programme loudness and true-peak audio level. International Telecommunication Union, 2023.
ITU-R BS.2076-2, Audio Definition Model (ADM). International Telecommunication Union, 2019.
EBU R128, Loudness normalisation and permitted maximum level of audio signals (and EBU Tech 3341/3342 metering). European Broadcasting Union, 2020.
ATSC A/85, Techniques for Establishing and Maintaining Audio Loudness for Digital Television. Advanced Television Systems Committee, 2013.
Dolby Laboratories, Dolby Atmos Renderer Guide and Dolby Atmos Music / Home / Cinema Production Guidelines. Dolby, current editions.
SMPTE ST 202 / ISO 2969, B-chain electroacoustic response of motion-picture theatres (X-curve); and T. Holman, Surround Sound: Up and Running, 2nd ed., Focal Press, 2008.

← Back to Workflows

The immersive post landscape​

Cinema, TV and streaming share one model​

The three delivery contexts​

The mix room​

A reference environment, not a nice-sounding room​

Reference levels and the X-curve​

Per-channel calibration and time alignment​

The spatial mixing strategy​

Dialogue is anchored​

Music and ambience spread; effects become objects​

Using the room and reverb​

The renderer and monitoring​

How the Atmos renderer maps the mix​

Monitoring the binaural render​

Checking the downmixes live​

Deliverables​

The ADM BWF master is the source of truth​

A typical deliverables table​

Downmix safety and fold-down​

Loudness and true-peak​

Why loudness rules exist, and what they measure​

The two regimes​

A worked loudness example​

Handling stereo and legacy material​

Most of what you mix is not immersive​

Strategies for placing stereo in the immersive field​

Preserving downmix safety from the start​

QC and conform​

Check every layout, not just the one you mixed on​

Conform: matching the master to the final picture​

Putting it together: a short film, end to end​

Step 1 — Room and session setup​

Step 2 — Spatial mix​

Step 3 — Render and monitor​

Step 4 — Loudness pass​

Step 5 — Generate and QC deliverables​

Common mistakes and limits​

The recurring failures​

Limits of the workflow​

References​