ADCRN: A Lightweight Neural Network for Binaural Reproduction on Head-Mounted Devices

Tianyou Li*, Xiaobin Rong*, Junjie Shi†, Huiyuan Sun†, Xiaohuai Le†, Chuanzeng Huang†, Jing Lu*
* Key Laboratory of Modern Acoustics, Nanjing University · † ByteDance

ABSTRACT

Binaural reproduction (BR) on head-mounted devices (HMDs) converts microphone recordings into binaural signals. Conventional binaural signal matching (BSM) predicts BR filters from predefined source distributions but degrades under mismatch. Recent DNN-based BR improves content accuracy and spatial fidelity, yet often with heavy parameters and compute, and their effectiveness on HMDs is under-explored. This demo presents a lightweight Attention-enhanced Dual-path Convolutional Recurrent Network (ADCRN) with a dual-path recurrent bottleneck for temporal–spectral modeling and efficient channel attention for cross-channel recalibration. On simulated datasets from AR-glasses arrays, ADCRN achieves state-of-the-art objective and subjective performance with fewer parameters and lower computational complexity.

1. Microphone Array & Network Structure

AR-glasses microphone array from the EasyCom Dataset
Microphone layout
Mic ID X (front, mm) Y (left, mm) Z (up, mm)
0418215
1100-119
281-7718
310-8315
ADCRN architecture overview
ADCRN model

2. Test samples for MUSHRA under different RT60

Note. For accurate binaural playback, please use wired stereo headphones connected directly to the device (disable any spatialization, EQ, or audio enhancements). Listen carefully for perceived source direction and spatial realism.
RT60 Reference ADCRN (Proposed) MDFNet LS MagLS iMagLS
0.1 s
0.4 s
0.65 s
0.9 s
1.2 s

3. Subjective evaluation: setup and results

MUSHRA setup
We conduct a MUSHRA test under five reverberation conditions with RT60 values of 0.1 s, 0.4 s, 0.65 s, 0.9 s, and 1.2 s. In each condition, each trial presents stimuli of the same utterance, including a hidden reference, a degraded anchor obtained by applying a 3.5 kHz low-pass filter to the reference, and the outputs of five methods, namely the proposed ADCRN, MDFNet, and three BSM baselines. The presentation order is randomized within each trial. Listeners rate spatial fidelity on a scale from 0 to 100, with higher scores indicating closer agreement with the reference in terms of both audio quality and perceived source localization. A total of 25 subjects participate in the subjective listening test.
MUSHRA results
MUSHRA results plot
Conclusion: Subjective evaluation results further validate the superiority of the proposed ADCRN model in terms of perceptual quality and spatial realism. ADCRN consistently receives ratings approaching those of the reference, with a concentrated distribution near the upper bound (above 90), indicating strong listener preference and high spatial fidelity. Compared to the BSM-based baselines (LS, MagLS, iMagLS), ADCRN shows substantial perceptual improvements, while also outperforming MDFNet despite its significantly lower model complexity. These results confirm that the ADCRN-rendered binaural signals not only match reference quality objectively, but also provide an immersive and perceptually faithful spatial audio experience.