Immersive Music Production

Immersive music is the place where every technique in this guide collides with commercial reality. Unlike a film mix, which is built once for a controlled cinema, a music mix has to survive being played on a calibrated 7.1.4 room, on AirPods on a subway, on a soundbar in a living room, and folded down to plain stereo on a kitchen radio — all from a single delivery. The craft is not "putting things behind the listener." It is making a mix that is genuinely better in immersive playback while never being worse in the stereo and binaural forms that the overwhelming majority of listeners will actually hear.

This chapter walks the full workflow: deciding whether to mix native or upmix, building an Atmos Music session of beds and objects, monitoring across speakers and headphones, the aesthetics of spatial placement, the deliverables, and the loudness rules of streaming. It leans heavily on the earlier parts — envelopment from direct, diffuse and envelopment, reverb as a spatial tool from reverberation, object panning from object-based audio, the binaural endpoint from binaural, and calibration from systems. And it returns repeatedly to the most important practical truth in all of audio: most music is, and will remain, stereo — so the way you handle stereo, as discussed in stereo is already spatial, determines whether your immersive work is an upgrade or a liability.

Why Immersive Music: From Frontal Stage to Envelopment

Stereo music inherited the geometry of the concert hall and the theatre: a frontal stage, performers arrayed across a $60°$ arc in front of the listener, depth implied by reverberation. It is an astonishingly capable format — a well-made stereo mix already encodes width, depth, height cues, and movement, which is exactly why "stereo is already spatial" is a recurring theme of this guide. Immersive music does not throw that away. It extends the soundfield from a frontal plane to a full sphere, adding two things stereo physically cannot deliver: lateral and rear energy that drives listener envelopment, and a height layer that lifts ambience and effects off the horizontal plane.

What envelopment actually adds

The perceptual payoff of immersive music is listener envelopment (LEV), the sense of being surrounded by a diffuse, enveloping sound field. As covered in direct, diffuse and envelopment, envelopment is driven primarily by lateral and rear reflections and reverberation, decorrelated between the ears, arriving after the early-reflection window (roughly later than $80\ \text{ms}$ in a hall, though in a mix you control this directly). Stereo can only deliver this from the front pair, so it leans on inter-channel decorrelation and the listening room's own reflections to fake the sense of space. Immersive playback delivers it physically: reverb tails, room tone, audience, pads and textures can be placed in the surrounds and overheads where they belong, freeing the front stage for the direct sound of the lead elements.

The second gain is separation. A dense stereo mix forces every element to compete for a narrow frontal window, which is why mastering-grade stereo mixes rely so heavily on EQ carving and dynamic control. Spreading elements across a sphere gives each one its own region, so a busy arrangement can breathe — vocals stay intelligible, the low end stays defined, and reverbs no longer smear the transients.

The artistic debate: "in the band" vs "best seat"

There are two coherent aesthetics, and you should pick one consciously per project:

Best seat in the house. Keep the band frontal, as in stereo, and use the surrounds and heights almost entirely for ambience, reverb, and audience. The listener sits in the ideal hall seat. This is conservative, translates beautifully to stereo and binaural, and is the safe default for acoustic, jazz, classical and singer-songwriter material.
In the band / inside the mix. Distribute instruments all around the listener — guitar behind, percussion overhead, backing vocals to the sides. Immersive becomes a creative instrument rather than an enhanced stereo. This is exciting and very effective for electronic, hip-hop and experimental work, but it is the form most likely to collapse badly in the stereo downmix and to feel gimmicky.

Key takeaway

Immersive music is not "stereo plus rear speakers." Its core value is envelopment — diffuse, decorrelated lateral and overhead energy that stereo cannot physically produce. Decide early whether you are building a "best seat" or an "in the band" mix, because that choice governs every panning and reverb decision downstream.

Native Immersive vs. Upmixing the Stereo Catalogue

The single biggest decision is whether you are producing a native immersive mix from multitrack stems, or upmixing an existing stereo master. They are different crafts with different risks.

Most music only exists as stereo

The catalogue is the reason upmixing matters. Decades of recordings exist only as a two-channel master; the multitracks are lost, on degraded tape, or contractually inaccessible. If immersive is to mean anything for back-catalogue and for the long tail of independent artists, it has to work from stereo. This is where the philosophy of stereo is already spatial becomes operational: a stereo file is not "flat" — it is a downmix of a spatial scene, and a good upmix decodes the spatial information that is already there rather than inventing new placement.

Decode, do not fabricate

The honest principle of upmixing: extract what the stereo already encodes and redistribute it, rather than synthesising spatial content that was never recorded. The classic technique is primary–ambient decomposition — separating the signal into a primary (direct, coherent, panned) component and an ambient (diffuse, decorrelated) component, then routing the primary to the front and the ambient to the surrounds and heights. The foundational reference is Avendano and Jot's work on frequency-domain inter-channel coherence: bins where the two channels are highly correlated are treated as primary and kept frontal; bins where they are decorrelated are treated as ambience and steered into the surround field. This is decoding, not fabrication — the reverb tail that was always in the stereo mix simply ends up where it perceptually belongs.

DAM Audio's HSR (High Space Resolution) approach, described at HSR research and in the companion article HSR stereo upmixing for multi-speaker systems, extends this idea to arbitrary multi-speaker and immersive targets: rather than a fixed two-to-five matrix, it analyses the stereo field at high spatial resolution and remaps coherent and diffuse components onto whatever layout is available, preserving the original frontal imaging while populating the surround and height layers with the decorrelated energy the recording already contains. The point of contact with this chapter is the same discipline: respect the original artistic balance; the upmix should be recognisably the same record, only larger.

When to choose which

Situation	Approach	Why
Multitracks available, new release	Native immersive	Full control of placement, reverb, height layer
Multitracks available, legacy hit	Native re-mix	Re-creates the record spatially; risk of changing the "sound" fans know
Stereo master only, large catalogue	Upmix (primary–ambient / HSR)	Only viable path; faithful if it decodes rather than invents
Stereo master, artist wants "definitive" 3D	Hybrid: upmix + targeted object extraction	Decode ambience, then lift identifiable stems if separable
Live two-track recording	Upmix to "best seat"	Audience and hall reverb steer naturally to surrounds

Upmixing pitfall

The fastest way to make an upmix sound wrong is to harden the centre and over-steer the sides, pulling correlated bass and lead vocal into the surrounds. A faithful upmix keeps the primary component frontal and mono-compatible; only the genuinely decorrelated ambience moves. If your upmix's stereo downmix no longer matches the original master, you have fabricated, not decoded.

The Atmos Music Workflow: Beds, Objects, and the Renderer

Dolby Atmos is the dominant commercial immersive-music format, so the practical workflow is worth knowing in detail. It is an object-based system layered on top of a channel bed.

Bed vs. objects

An Atmos Music session is built from two kinds of signal:

The bed — a static channel-based submix, almost always 7.1.2 (a $7.1$ horizontal layout plus two overheads). The bed is where you put things that should not move and that benefit from being a fixed, predictable layer: the stereo-correlated core of the mix, reverb returns, room tone, sustained pads. Think of the bed as your "stereo-compatible foundation."
Objects — up to $118$ dynamic objects, each a mono (or grouped) signal with positional metadata $(x, y, z)$ that the renderer turns into speaker feeds. Lead vocal, kick, snare, a lead synth, anything you want to place precisely or move, becomes an object.

The total is the familiar $128$ channels: $10$ for the 7.1.2 bed plus $118$ objects. You do not have to use all of them; many excellent mixes use a bed plus a dozen objects.

Panning instruments and reverbs

Objects are panned with three coordinates plus a size parameter. Size is the immersive-mix equivalent of stereo width: a size of $0$ is a point source; increasing size spreads the object across more speakers, trading precise localisation for a larger, more diffuse image. This is how you make a pad feel "everywhere" without comb filtering — you increase its size rather than duplicating it to multiple discrete positions.

Reverb handling is where immersive music earns its keep. The standard move is to put the dry source as an object (frontal, localised) and route its reverb send to the bed surrounds and overheads, so the direct sound localises in front while the tail wraps around — exactly the direct/diffuse separation that defines envelopment in direct, diffuse and envelopment. Use a true multichannel reverb, or several decorrelated stereo reverbs, so the tails arriving from different directions are not identical copies (more on decorrelation below).

The renderer and the binaural deliverable

The Dolby Atmos Renderer (or the built-in renderer in a DAW such as Logic, or the Dolby plug-in suite in Pro Tools/Nuendo) takes the bed and objects and renders them three ways simultaneously:

To your speaker layout for monitoring (7.1.4, 5.1.4, 9.1.6, etc.).
To a binaural stereo stream for headphone monitoring, applying HRTFs per binaural.
To the stereo and 5.1 downmixes used for delivery and compatibility checking.

The binaural render is not an afterthought — it is what most listeners will actually hear, because the dominant consumption of Atmos Music is on phones and headphones via streaming. Each object carries a binaural render mode — Off, Near, Mid, or Far — that controls how much synthetic distance and reverb the binaural renderer applies. Setting the lead vocal to Near keeps it intimate and dry in headphones; setting a reverb-heavy pad to Far pushes it into the synthetic room. You must audition and set these per element, because a mix that is gorgeous on speakers can sound distant, washy, or weirdly localised in binaural if the render modes are left at defaults.

Render mode as a mix decision

Treat the per-object binaural render mode as a first-class mixing parameter, not metadata you set once. Lead vocal and kick almost always want Near; ambience, audience and long reverbs want Mid or Far. Audition the binaural render as often as you audition the speakers — for streaming, it is the master.

Monitoring: Speakers vs. Headphones, and the Translation Problem

You cannot mix what you cannot trust, and immersive monitoring has two faces that disagree with each other.

The calibrated speaker room

The reference monitoring environment for Atmos Music is a 7.1.4 room: seven horizontal speakers (L, R, C, plus two side and two rear surrounds), one or more subs, and four overhead/height speakers. Building and trusting that room is pure systems work:

Layout and angles per speaker layouts and topologies and ITU-R BS.2051 — get the surround and height angles right, because object metadata assumes a standard geometry.
Time alignment per time alignment and phase: every speaker must be time-aligned to the listening position so a centred phantom is coherent and an overhead pan does not smear.
Level calibration: each full-range channel is set to the same reference SPL — the standard is $85\ \text{dB}$ SPL (C-weighted, slow) per channel from $-20\ \text{dBFS}$ pink noise in a large room, or commonly $79$ – $82\ \text{dB}$ for nearfield music rooms. The LFE is calibrated $+10\ \text{dB}$ in band. This is the measurement and calibration procedure, and it is non-negotiable: if your channels are not level-matched, every pan you make is a lie.
Bass management per subwoofers and bass management: small height speakers cannot reproduce low frequencies, so their bass is redirected to the sub. Get the crossover and sub alignment right or the low end will shift in space as you pan.
Room correction per equalization and room correction: correct the steady-state response modestly, especially channel-to-channel matching, but do not over-EQ the reflections that the room needs for natural envelopment.

Headphones and the translation problem

Most of your audience is on headphones, hearing the binaural render, not your speakers. The two endpoints disagree because:

Binaural relies on a generic HRTF that is not the listener's own, causing front/back confusion, in-head localisation and timbral colouration that your speaker mix never shows.
Headphones give perfect channel isolation and no room; speakers give crosstalk and a real room that adds its own envelopment. A mix that needs the room's help to feel wide will sound narrow and dry in headphones.
Bass perception differs enormously: the sub you feel in the room is just headphone diaphragm excursion on a phone.

The practical workflow is to mix on the calibrated speakers, but check the binaural render constantly, and ideally A/B on consumer earbuds (AirPods are the de-facto reference because of their market share). Decisions that look great on speakers — a hard rear-panned guitar, a very wet overhead reverb — frequently need taming for binaural. This is the immersive-era version of the classic "check it on the car stereo."

Monitoring path	What it reveals	What it hides
Calibrated 7.1.4 speakers	True spatial geometry, envelopment, low end	Generic-HRTF artefacts, headphone narrowing
Binaural render on headphones	What most listeners hear; front/back confusion, dryness	Real-room envelopment, accurate sub
Consumer earbuds (AirPods)	Real-world translation, bass loss, loudness feel	Precise imaging, reference accuracy
Stereo downmix on nearfields	Mono/stereo compatibility, the "still must be right" master	Anything immersive

The room can lie to you

If you mix wide and reverberant on speakers because it "fills the room," that fill is partly your room's own reflections — which the headphone listener does not get. Always confirm width and depth on the binaural render before committing. A mix that depends on your room to sound spacious will sound thin to the millions on earbuds.

Spatial Mixing Aesthetics

Once the room and the session are trustworthy, the actual art begins. The recurring principle: use the extra space to separate and envelop, not to distract.

Placing the lead vocal

The lead vocal is the anchor and almost always belongs front and centre, as an object placed at or near the centre speaker. Resist the temptation to spread it for "width" — a wide, decorrelated vocal smears the most important element in the mix and collapses unpredictably in stereo and binaural. Add space around the vocal with reverb in the surrounds and heights, keeping the dry voice tight and frontal (binaural mode Near). If you want the vocal to feel larger, raise its object size slightly rather than duplicating it to multiple positions.

Drums and the low end

Keep kick and bass low, centred and frontal. Low frequencies localise poorly and are routed through bass management anyway, so spreading them buys nothing and risks phase problems across the sub crossover. Snare and the core kit usually stay frontal too. The kit is where height can be tasteful: overhead/room mics lifted into the height layer give the drums a believable ceiling, and percussion or hi-hat accents can move to the sides for energy. The principle is the same as in surround recording: the direct kit stays put while the room of the kit goes up and around.

Ambience, pads and the surround field

The surrounds and heights are home for everything diffuse: pads, atmospheres, audience, room tone, backing-vocal washes and reverb tails. This is where you build envelopment, and where decorrelation matters most.

Width and decorrelation without comb filtering

To make something feel wide and enveloping you need the signals arriving from different directions to be decorrelated — if they are identical copies at slightly different delays, you get comb filtering and a hollow, phasey timbre, and the image collapses on downmix. The tools:

Decorrelation by all-pass or short, frequency-dependent delay, not a single fixed delay. A flat delay of a few milliseconds creates a comb notch at $f = 1/(2\,t_d)$ ; a $0.5\ \text{ms}$ delay notches near $1\ \text{kHz}$ , right in the vocal range. Use a proper decorrelator (randomised phase across frequency) instead.
Genuinely different reverbs for different directions rather than one stereo reverb panned wide.
Object size rather than multi-position duplication for spreading a single source.

The mono-compatibility test is the referee: sum to mono and listen. If a "wide" element thins out or disappears, it is comb filtering, not width.

Decorrelation discipline

Width that comes from copying a signal to several speakers with small delays is comb filtering in disguise. It sounds impressive on the immersive system and falls apart the moment it is summed for the stereo or mono downmix. Decorrelate with randomised-phase processors, use distinct reverbs, and check every wide element in mono.

Reverb as a spatial instrument

In stereo, reverb is mostly a depth and glue tool. In immersive it becomes a placement instrument: where the tail comes from is now a creative choice. Drawing on reverberation, think in the same direct/early/late structure:

Direct sound as a frontal object, dry and localised.
Early reflections lightly into the front-wide and side speakers to set apparent source width and a sense of a real space.
Late, diffuse tail spread across surrounds and heights, decorrelated, to build envelopment.

Keeping these three layers in different parts of the sphere is precisely how a real hall works, and it is far more convincing than a single wet reverb smeared everywhere. A long, decorrelated tail in the heights is one of the most reliably beautiful effects in immersive music.

Deliverables and Distribution

Immersive music delivery is a small family of files, and the unglamorous truth is that the stereo downmix still has to be right, because it is what most plays will use.

The ADM master

The primary immersive deliverable is the ADM BWF — a Broadcast Wave (.wav) file carrying the Audio Definition Model metadata that describes the bed and every object's position over time. This single file is the immersive master; from it, platforms and the Dolby encoder derive the playback streams. It is rendered ("printed") from your session at the end of the mix and must match what you heard.

Binaural and the all-important stereo downmix

From the ADM, you (and the platform) derive:

The binaural stream — the headphone master, governed by your per-object render modes.
The 5.1 downmix for surround systems and some broadcast.
The stereo (2.0) downmix — the legacy master that the vast majority of plays still use.

The stereo downmix is generated by the renderer's fold-down, and you are responsible for checking it. A mix that is gorgeous in 7.1.4 but has a phasey, vocal-buried, or level-shifted stereo fold-down is a failed deliverable, because more listeners will hear the stereo than the immersive. This is the whole reason the chapter keeps returning to stereo is already spatial: the stereo version is not a throwaway, it is a co-equal master.

Deliverable	Format	Audience / use	Must check
ADM BWF master	`.wav` + ADM metadata	The immersive master; platform ingest	Object positions, bed content, peak/true-peak
Binaural render	Stereo `.wav` / encoded	Headphone listeners (the majority)	Render modes, dryness, front/back, loudness
5.1 downmix	6-channel `.wav`	Surround systems, broadcast	Fold-down balance, LFE content
Stereo (2.0) downmix	Stereo `.wav`	Most plays; legacy, radio, previews	Mono compatibility, vocal level, loudness $\approx -14$ LUFS

Streaming platform requirements

Apple Music, Tidal and Amazon Music HD distribute Atmos Music; they ingest the ADM master and handle encoding (to Dolby's streaming bitstreams) and binaural rendering on the playback device. Practical constraints to respect: deliver at the platform's sample rate (commonly $48\ \text{kHz}$ ), keep true-peak under the platform limit (typically $-1\ \text{dBTP}$ ), and supply the matching stereo master. Each platform has an ingest spec sheet — read it, because details like channel order, metadata fields and loudness targets vary and a rejected delivery costs days.

Loudness for Music Streaming

Loudness is where immersive music meets a hard, quantitative constraint, and it is frequently misunderstood.

The normalization targets

Music streaming services apply loudness normalization so tracks play back at a consistent level. Drawing on the measurement methodology of ITU-R BS.1770 and the broadcast practice of EBU R128 (which standardises $-23\ \text{LUFS}$ for broadcast), the music platforms chose quieter program but a louder reference for casual listening — they normalise to roughly:

Spotify: $-14\ \text{LUFS}$ (integrated).
Apple Music: $-16\ \text{LUFS}$ .
Tidal / Amazon / YouTube: in the $-14$ to $-15\ \text{LUFS}$ region.

The number that matters most in practice is the $-14\ \text{LUFS}$ ballpark. If your integrated loudness is louder than the target, the platform turns you down; if quieter, many platforms leave you alone (or, with positive-gain settings, turn you up). Either way, your hyper-compressed loud master gains no perceived-loudness advantage — it just loses dynamic range for nothing.

How loudness interacts with the immersive mix

Two wrinkles are specific to immersive:

The Atmos master and the stereo master are measured and normalised separately. Apple Music normalises the Atmos stream's loudness independently from the stereo version. If your immersive mix is much louder or quieter than your stereo master, listeners toggling between them hear a jarring level jump, and the "is Atmos better?" judgement gets contaminated by a simple loudness difference — the classic confound where louder is naively preferred. Aim to loudness-match your immersive and stereo masters so the comparison is fair.
Dolby's own integrated-loudness guidance for Atmos Music targets around $-18\ \text{LUFS}$ on the 7.1.4 render in the renderer's loudness meter, with the stereo downmix landing near streaming targets. Because the immersive mix has more channels and more diffuse energy, its measured loudness and its subjective loudness can diverge from the stereo fold-down; trust the meter on the binaural/stereo render that the listener actually gets.

The discipline mirrors broadcast practice: mix to a loudness target, control true-peak, and let the platform normalise. Do not master immersive music to be "loud." There is no loudness war to win when everything is normalised to the same target.

Loudness rule of thumb

Target roughly $-14\ \text{LUFS}$ integrated on the deliverable the listener hears, keep true-peak $\leq -1\ \text{dBTP}$ , and loudness-match your immersive master to your stereo master so toggling between them is a fair A/B. Chasing loudness past the normalization target only sacrifices dynamics.

Putting It Together: Two End-to-End Procedures

Worked example A — native immersive mix of a stereo-conceived song

A four-minute pop song: lead vocal, doubled backing vocals, kick, snare, hats, bass, two electric guitars, a pad, and a piano. Multitracks available. Target: Apple Music Atmos plus a matched stereo master.

Step 1 — Room and session setup. Confirm the 7.1.4 room is calibrated: levels matched to reference per measurement and calibration, speakers time-aligned per time alignment and phase, bass management set. Open the Atmos renderer; create a 7.1.2 bed and route stems.

Step 2 — Build the frontal foundation. Place lead vocal as an object at centre, size $0$ , binaural mode Near. Kick and bass as objects low and centred (or in the bed), frontal. Snare centre-front. This is the "best seat" core and the basis of a clean stereo fold-down.

Step 3 — Open the front stage. Pan the two guitars as objects to left-wide and right-wide, roughly $\pm 60°$ , each slightly different in size so they are not mirror twins. Piano spread modestly across the front with a small size value. Backing vocals as two objects pulled to the sides ( $\pm 90°$ – $110°$ ), decorrelated from each other.

Step 4 — Build envelopment. Put the pad into the bed with increased size so it occupies surrounds and overheads diffusely. Set up two distinct reverbs: a short plate for the drums fed into the front-wides, and a long hall for the vocal fed into the surrounds and heights, decorrelated. The dry vocal stays frontal; its tail wraps the listener — the direct/diffuse split of direct, diffuse and envelopment.

Step 5 — Height. Lift drum room/overhead mics into the height layer for a ceiling on the kit. Add a touch of the long reverb's tail overhead. Keep direct, transient sources out of the heights — overheads are for ambience, not lead elements.

Step 6 — Check binaural. Switch to the binaural render on headphones. The wide backing vocals may sound diffuse and odd — pull them in or reduce size. The overhead reverb may be too wet — trim it. Set render modes: vocal/kick Near, guitars/piano Mid, pad and reverbs Far. Iterate between speakers and binaural until both hold up.

Step 7 — Check the stereo fold-down. Render the stereo downmix. Confirm the vocal sits at the right level, the low end is intact, and nothing combs out in mono. Adjust object sizes/levels until the fold-down is a master you would ship on its own.

Step 8 — Loudness and print. Read integrated loudness; aim near $-14\ \text{LUFS}$ on the binaural/stereo render, true-peak $\leq -1\ \text{dBTP}$ . Loudness-match the stereo master to the immersive one. Print the ADM BWF, render binaural and stereo references, and deliver per the platform spec.

Worked example B — faithful upmix of a legacy stereo track

A 1970s soul recording: stereo master only, no multitracks. Target: an immersive version that is recognisably the same record, plus the untouched stereo master.

Step 1 — Analyse the stereo. Listen and inspect correlation. Identify what is primary (lead vocal and bass, strongly centre-correlated) and what is ambient (string reverb, room, hand-claps spread wide, decorrelated).

Step 2 — Primary–ambient decomposition. Apply a primary–ambient extractor (Avendano–Jot-style inter-channel coherence) or the HSR analysis from HSR research and HSR upmixing. The coherent primary is preserved frontal; the decorrelated ambient is the material that can move.

Step 3 — Route, do not invent. Keep the primary (vocal, centred instruments, bass) in the front bed and centre object exactly as the original placed it. Steer the ambient component into the surrounds and heights, building envelopment from the reverb and room that were always in the recording. Do not synthesise new placements — no fabricating a drum kit around the listener.

Step 4 — Restraint check. Compare your immersive mix's stereo downmix against the original master. They should be nearly identical in balance. If the vocal moved, the bass thinned, or the energy shifted to the sides, you over-steered — back off until the fold-down matches.

Step 5 — Binaural and delivery. Set conservative render modes (vocal Near, ambience Far), confirm the binaural render still sounds like the record, loudness-match to the original stereo master, print the ADM, and deliver. The result should feel like the listener stepped inside the original mix's room — not like a new arrangement.

Common Mistakes and Limits

Common mistakes

Gimmicky panning. Flying objects around the room for novelty. It impresses once, fatigues immediately, and almost always wrecks the stereo fold-down. Movement should serve the music, not the demo.
A broken stereo downmix. Mixing only on the immersive system and never checking the fold-down. Since most plays are stereo, a phasey or vocal-buried downmix means most listeners get a worse version than the original stereo would have been.
Over-decorrelation and comb filtering. Faking width with small delays and duplicated sources. It sounds wide on the array and hollow/cancelled in mono. Decorrelate properly and always check mono.
Ignoring the binaural render. Leaving render modes at defaults. The headphone listener — the majority — then hears a distant, washy, or mislocalised mix that you never auditioned. Binaural is the streaming master; treat it as such.
Putting bass and lead vocal in motion or in the surrounds. Low frequencies localise poorly and route through the sub anyway; a wandering vocal destroys the anchor. Keep them centred and frontal.
Loudness mismatch between masters. Letting the immersive and stereo versions differ in loudness, so toggling produces a level jump and a biased "Atmos sounds better" impression.
Over-correcting the monitoring room. Aggressive EQ that kills the reflections the room needs for natural envelopment, per equalization and room correction. Correct gently, match channels, leave the room its life.
Trusting the room's spaciousness. Building width that depends on your room's reflections, which the earbud listener does not get.

Limits

The HRTF problem is unsolved at scale. Binaural uses a generic HRTF, so a fraction of listeners get front/back confusion and timbral colouration no mix can fully fix. Personalised HRTFs help but are not yet ubiquitous — this is the standing limit of binaural.
Upmixing cannot create information that was never recorded. It decodes; it does not regenerate lost multitracks. A mono recording upmixes to very little, and an aggressively limited stereo master yields limited ambient material.
The format is commercially gated. Atmos Music's reach depends on a handful of platforms and on consumer hardware; the "true" experience needs either a calibrated room few own or headphones with their inherent compromises.
No universal "right" mix. Because speaker and binaural endpoints disagree, every immersive master is a negotiated compromise across playback systems — the same translation problem that has always defined music production, now with more axes.

The one habit that matters most

Build and trust your monitoring first, then mix on speakers but audition the binaural render and the stereo fold-down constantly, loudness-matched. If a decision improves the immersive version while keeping the binaural and stereo masters at least as good as a great stereo mix would be, keep it. If it only helps on the big rig, it is probably a gimmick.

References

Dolby Laboratories. Dolby Atmos Music Production Suite and Dolby Atmos Renderer Guide. Dolby, current editions — bed/object structure, binaural render modes, ADM delivery, loudness guidance.
Avendano, C., and Jot, J.-M. "A Frequency-Domain Approach to Multichannel Upmix." Journal of the Audio Engineering Society, vol. 52, no. 7/8, 2004 — primary–ambient decomposition for faithful upmixing.
ITU-R BS.1770. Algorithms to measure audio programme loudness and true-peak audio level. International Telecommunication Union — the LUFS / true-peak measurement basis for all streaming loudness.
EBU R128. Loudness normalisation and permitted maximum level of audio signals. European Broadcasting Union — the $-23\ \text{LUFS}$ broadcast reference and loudness-normalisation methodology underpinning streaming targets.
ITU-R BS.2051. Advanced sound system for programme production. International Telecommunication Union — channel and speaker layouts (including 7.1.2 / 7.1.4 geometry) used by immersive monitoring.
ITU-R BS.1116. Methods for the subjective assessment of small impairments in audio systems. International Telecommunication Union — rigorous listening-test methodology, including loudness-matched comparison.
Rumsey, F. Spatial Audio. Focal Press, 2001 — spatial attributes, envelopment, source width, and multichannel mixing fundamentals.
DAM Audio. HSR: High Space Resolution Stereo Upmixing. dam-audio.com/research/hsr-high-space-resolution — coherence-based remapping of stereo to multi-speaker and immersive targets.

← Back to Workflows

Why Immersive Music: From Frontal Stage to Envelopment​

What envelopment actually adds​

The artistic debate: "in the band" vs "best seat"​

Native Immersive vs. Upmixing the Stereo Catalogue​

Most music only exists as stereo​

Decode, do not fabricate​

When to choose which​

The Atmos Music Workflow: Beds, Objects, and the Renderer​

Bed vs. objects​

Panning instruments and reverbs​

The renderer and the binaural deliverable​

Monitoring: Speakers vs. Headphones, and the Translation Problem​

The calibrated speaker room​

Headphones and the translation problem​

Spatial Mixing Aesthetics​

Placing the lead vocal​

Drums and the low end​

Ambience, pads and the surround field​

Width and decorrelation without comb filtering​

Reverb as a spatial instrument​

Deliverables and Distribution​

The ADM master​

Binaural and the all-important stereo downmix​

Streaming platform requirements​

Loudness for Music Streaming​

The normalization targets​

How loudness interacts with the immersive mix​

Putting It Together: Two End-to-End Procedures​

Worked example A — native immersive mix of a stereo-conceived song​

Worked example B — faithful upmix of a legacy stereo track​

Common Mistakes and Limits​

Common mistakes​

Limits​

References​