Knowledge Base

Speech Intelligibility In Reverberation Based On Audio-Visual Scenes Recordings Reproduced In A 3D Virtual Environment

Author (s)

Angela Guastamacchia, Fabrizio Riente, Louena Shtrepi, Giuseppina Emma Puglisi, Franco Pellerey, Arianna Astolfi

Affiliation

Politecnico di Torino, Department of Energy, Corso Duca degli Abruzzi

Publication date

2024

Abstract

Audio-visual scenes were collected in a medium-sized reverberant conference hall through in-field 3rd-order ambisonics impulse response recordings and 360-degree stereoscopic videos. The visual scenes included cues of the room and the location of the sound sources, without lip-sync-related cues. Speech intelligibility tests based on seven audio-visual scenes were administered inside an immersive virtual 3D environment reproduced through a spherical 16-speaker array synched with a head-mounted display. Forty normal-hearing subjects were engaged to test the effects on speech intelligibility of a talker in front of the listener and amplified by two lateral symmetrical loudspeakers, in the case of (i) different listener-to-talker distances, (ii) one-talker noise at various azimuth angles around the listener, (iii) high reverberation with –5 dB signal-to-noise ratio, (iv) self-motion, and (v) visual cues. We conducted tests in four configurations, that is, audio-visual and audio-only, both with self-motion and in the static condition. The static audio-only tests scored the highest speech intelligibility, followed by a tie between audio-visual with self-motion and in the static condition. Speech intelligibility decreased as the target-to-listener distance increased in all the noisy scenes. Additionally, speech intelligibility increased when the noise azimuth was at 120° compared to both 180° and 0° , with the talker at approximately 8 m from the listener. The advantage of the spatial separation of the noise signal in reverberation is evident in the case of the audio-visual with self-motion test. This suggests a spatial release from masking in the presence of reverberation, one-talker-interfering noise and within an more ecological scene.

Full paper

https://www.sciencedirect.com/science/article/pii/S0360132324003962

Keywords

Speech intelligibilitySelf-motionVisual cuesAudio-visual recordings3D virtual environmentSpatial release from masking