First-Order Ambisonics (FOA) Encoder

Updated 11 November 2025

First-Order Ambisonics (FOA) encoder is a framework that represents 3D sound fields using four B-format channels capturing both omnidirectional pressure and directional responses.
It employs microphone array regularization, sparse MDCT-based upmixing, and neural encoder techniques to enhance spatial resolution and noise suppression.
Recent advancements integrate FOA encoders into multimodal pipelines, enabling efficient compression, robust spatial reasoning, and improved sound localization in diverse applications.

First-Order Ambisonics (FOA) encoding provides a mathematically rigorous and physically interpretable framework for capturing, representing, analyzing, and generating 3D spatial audio. FOA encodes the sound field using the four real spherical harmonics of order ≤1—the so-called B-format channels (W, X, Y, Z)—thus capturing both omnidirectional pressure and the three orthogonal directional (figure-of-eight) responses. Recent research advances span signal-theoretic encoders for microphone arrays, sparse plane-wave expansions for upmixing and direction-of-arrival estimation, learned spatial compression, and deep integration into modern multimodal and spatial reasoning architectures.

1. Mathematical Foundations and FOA Channel Definitions

FOA encoding represents the 3D sound field $p(\theta, \phi, t)$ on the unit sphere using projections onto the lowest-order spherical harmonics $Y_\ell^m(\theta, \phi)$ :

$\begin{align*} W(t) & = \iint_{S^2} p(\theta, \phi, t)\,Y_0^0(\theta, \phi) d\Omega \ X(t) & = \iint_{S^2} p(\theta, \phi, t)\,Y_1^{-1}(\theta, \phi) d\Omega \ Y(t) & = \iint_{S^2} p(\theta, \phi, t)\,Y_1^{0}(\theta, \phi) d\Omega \ Z(t) & = \iint_{S^2} p(\theta, \phi, t)\,Y_1^{+1}(\theta, \phi) d\Omega \end{align*}$

Practical (B-format) implementations define:

$\begin{align*} W &= \frac{p}{\sqrt{2}} \ X &= p \cos\theta \cos\phi \ Y &= p \sin\theta \cos\phi \ Z &= p \sin\phi \end{align*}$

A plane wave incident from spherical angles $(\theta,\phi)$ leads to FOA channels with explicit directional content, mapping sound source position to spatial audio cues.

2. Microphone Array-Based FOA Encoding and Regularization

For physical measurement, a spherical microphone array samples the sound field at $Q$ positions $\{\Omega_q\}$ :

$\mathbf{p}(k, r) = \mathbf{Y}(\Omega)\,\mathbf{B}(kr)\,\mathbf{a}(k) + \mathbf{n}(k)$

Here,

$\mathbf{Y}(\Omega)$ : $Q\times4$ real spherical harmonic matrix, $Y_n^m(\Omega_q)$ for $n=0,1$ .
$\mathbf{B}(kr)=\mathrm{diag}\{R_0(kr), R_1(kr), R_1(kr), R_1(kr)\}$ : Frequency-dependent radial functions for a rigid sphere.
$\mathbf{a}(k)$ : Vector of FOA coefficients.
$\mathbf{n}(k)$ : Measurement noise.

The least-squares estimate:

$\hat{\mathbf{a}}(k) = (\mathbf{Y}\mathbf{B})^\dagger\mathbf{p}(k,r)$

poses numerical challenges at low frequency due to the vanishing of $R_n(kr)$ for $n\ge1$ , amplifying noise. Regularization schemes include:

Tikhonov (diagonal) regularization:

$c_n(k)=\frac{|R_n(kr)|^2}{\,|R_n(kr)|^2+\lambda^2\,},\qquad \hat{\mathbf{a}}_R = \mathbf{C}\,(\mathbf{Y}\,\mathbf{B})^\dagger\mathbf{p}$

Spectral floor: $1/R_n \rightarrow 1/(R_n+\varepsilon)$ for lower-bounded inverse.

Selecting $\lambda$ or $\varepsilon$ appropriately trades off distortion of the encoded Ambisonics coefficients versus suppression of noise, as quantified by metrics such as:

$DIST = \frac{\|\,( \mathbf{C}-\mathbf{I} )\,\mathbf{a}\|^2}{\|\mathbf{a}\|^2} \qquad G_{noise} = \frac{E[\|\mathbf{C}(\mathbf{Y}\mathbf{B})^\dagger \mathbf{n}\|^2]}{E[\|\mathbf{n}\|^2]}$

Experiments indicate deep networks operating on Ambisonic signals must be informed of (or trained across) the regularization to maintain localization and detection accuracy, especially in real measured data with low SNR (Shaybet et al., 2024).

3. Sparse Plane-Wave Expansion and Upmixing via MDCT

FOA upmixers based on real-valued sparse MDCT decompositions enable high-precision spatial resolution, surpassing classical methods such as HARPEX or DirAC in sharpness of plane-wave localization (Likhachov et al., 2023).

Key elements:

Signal chain: Compute windowed MDCT on each FOA channel using overlapping frames. Assemble the spectra $\mathbf{X}(n,k)\in\mathbb{R}^4$ per time–frequency tile.
Sparse decomposition: Express the MDCT tile as a sparse sum over a dictionary $D$ of plane-wave atoms parameterized by direction and MDCT frequency:

$\min_{A} \|X - DA\|_2^2 + \lambda\|A\|_1$

Each atom: $d_i = [1/\sqrt{2}, \cos\theta_i\cos\phi_i, \sin\theta_i\cos\phi_i, \sin\phi_i]^\top \otimes b_i$ , where $b_i$ is a real MDCT basis vector.

Solver: Iterative gradient-descent with soft-thresholding drives coefficients $A$ to be sparse. Multi-layer MDCT (32, 128, 256, 1024, 2048) enables joint time/frequency precision.
Plane-wave synthesis and DoA estimation: Each active atom yields FOA gains following the real SH basis. DoA per tile is computed in closed form—a significant advantage over complex-FFT parametric methods.

$\begin{align*} a &= \sqrt{X^2 + Y^2 + Z^2} \ (\tilde x, \tilde y, \tilde z) &= (X/a, Y/a, Z/a) \ \theta &= \operatorname{atan2}(\tilde y,\tilde x),\quad \phi = \arcsin(\tilde z) \end{align*}$

Subjective and objective results show sharper, more stable spatial images than both linear MDCT and traditional parametric upmixers, especially in scenes containing transients and multiple sources.

4. Neural and Learned FOA Encoders: Compression and Representation

Neural approaches provide discrete compressed representations of FOA (spatial) audio, while explicitly preserving crucial spatial cues:

FOA Tokenizer (Sudarsanam et al., 25 Oct 2025): Extends the WavTokenizer architecture to 4-channel FOA. Uses initial $1\times1$ $1 \times 1$ convolution ( $4\to512$ $4 \to 512$ channels), a series of stride-2 1D convolutions (kernel 7, GroupNorm+SiLU, residuals) totaling $320\times$ $320 \times$ downsampling, and outputs a latent sequence of length $T/320\times 512$ $T /320 \times 512$ .
- VQ layer: Single codebook ( $V=4096$ , $D=512$ ), trained with EMA and dead-code reactivation.
- Losses: Key is the spatial consistency loss, which enforces cosine similarity of intensity-vector directions between input and reconstruction, with masking for energetic/direct bins:

$\mathcal{L}_{sc} = \frac{1}{T\,K}\sum_{t=1}^{T}\sum_{k=1}^{K} w_{t,k}\,\bigl(1 - s_{t,k}\bigr)$

Bitrate: $0.9$ kbps ($75$ tokens/s $\times$ $12$ bits/token).

Effectiveness: On simulated, anechoic, and real-RIR datasets, angular errors for FOA-VQGAN at 0.9 kbps attain $13.76^\circ$ (in-domain), $3.96^\circ$ (anechoic), $25.83^\circ$ (real-RIR MEIR). The codec supports downstream tasks (SELD), demonstrating angular preservation superior to Opus at much lower bitrates.

This class of encoder yields compressed FOA signal tokens suitable for both transmission and machine downstream tasks, with spatial information retained in latent space.

5. FOA Encoders in Machine Learning and Spatial Understanding Pipelines

Spatial Audio QA and LALMs: The SPUR framework (Sakshi et al., 10 Nov 2025) introduces an FOA encoder mapping B-format signals to spatially structured features for audio-LLMs.
- SSCV feature extraction: Band-limited STFTs, mel-weighted per-band covariances, one-pole temporal smoothing, and real-valued vectorization project FOA into $16$-dimensional, rotation-aware scene vectors per time–frequency tile.
- Network: Stacked 3D Conv layers over time, frequency, and channel, patch-tokenization, and transformer encoding yield geometry-aware scene tokens. A 2-layer MLP adapts these tokens for integration into pre-trained LALMs (via LoRA adapters).
- Empirical findings: Removing banded covariance or normalization reduces spatial QA or localization performance by 1–2 points on the SPUR-Set, underscoring the role of these FOA-derived features.
Contrastive AV Pretraining for SELD: Systems such as (Fujita et al., 2024) ingest FOA mel spectrograms and intensity features, then use a ResNet–Conformer backbone and projection head to produce per-direction latent embeddings. Joint audio–visual contrastive objectives at both DOA-wise and recording-wise levels enable self-supervised learning of spatial structure, reducing SELD error by $1.5$ points on real-world VR content.

6. End-to-End FOA Generation and Spatial Audio Synthesis

Recent models synthesize or translate to FOA directly from semantic or visual context by using FOA encoders as part of multi-modal generative frameworks.

Diff-SAGe (Kushwaha et al., 2024): Represents FOA as an $8\times T \times F$ tensor (real/imag parts of STFT over four channels), conditions a flow-matching Transformer on semantic and spatial cues, and reconstructs FOA using ISTFT. The approach preserves inter-channel phase, enabling conditioning on content class and DoA for generation tasks.
OmniAudio (Liu et al., 21 Apr 2025): Defines an FOA encoder-decoder as a spatial VAE within a pipeline mapping features from a dual-branch video transformer (panoramic + FoV) into FOA spatial latents. Losses are multi-resolution STFT (plus KL/adversarial terms), with flow-matching in latent space ensuring accurate alignment of output (W,X,Y,Z) channels to visually inferred source directions.

7. Implementation and Computational Considerations

Computational Load: Classical FOA MDCT upmixers require multiple MDCTs per channel per block ( $O(L M\log M)$ ), repeated over several resolutions, plus $N$ iterations of sparse decomposition per tile; real-time implementations may require solver acceleration (OMP, FISTA) or precomputed dictionaries (Likhachov et al., 2023).
Neural Codecs: Strided-convolutional encoders (tokenizers) efficiently process FOA sequences but require careful VQ codebook sizing, code reactivation, and feature normalization to maintain spatial fidelity.
Integration with DNNs and LLMs: FOA feature normalization, preservation of channel symmetries, banded covariance, and transformation invariances are critical for robustness and spatial-consistency in downstream models (Sakshi et al., 10 Nov 2025). Training over regularization parameter grids, augmenting data with spatial transformations, and exposing all pertinent signal features as model input are recommended.
Latency and Windowing: The largest MDCT/STFT window sets latency, typically ranging from $40$–$50$ ms for $2048$ samples at $48$ kHz, with additional buffer for overlap processing.

First-Order Ambisonics (FOA) encoding thus remains the central bridge between physical field capture, compact scene representation, upmixing, and advanced spatial audio processing for machine understanding and audio-visual generation across research and industry. The existence of high-fidelity regularized encoders, robust learned compressive models, and transformer-based spatial reasoning systems mark continued convergence of classical spatial acoustics and modern audio machine learning.