Studies on perceived audio-visual spatial coherence in the literature have commonly employed continuous judgment scales. This method requires listeners to detect and to quantify their perception of a given feature and is a difficult task, particularly for untrained listeners. An alternative method is the quantification of a percept by conducting a simple forced choice test with subsequent modeling of the psychometric function. An experiment to validate this alternative method for the perception of azimuthal audio-visual spatial coherence was performed. Furthermore, information on participant training and localization ability was gathered. The results are consistent with previous research and show that the proposed methodology is suitable for this kind of test. The main differences between participants result from the presence or absence of musical training.
The ventriloquism effect describes the phenomenon of audio and visual signals with common features, such as a voice and a talking face merging perceptually into one percept even if they are spatially misaligned. The boundaries of the fusion of spatially misaligned stimuli are of interest for the design of multimedia products to ensure a perceptually satisfactory product. They have mainly been studied using continuous judgment scales and forced-choice measurement methods. These results vary greatly between different studies. The current experiment aims to evaluate audio-visual fusion using reaction time (RT) measurements as an indirect method of measurement to overcome these great variances. A two-alternative forced-choice (2AFC) word recognition test was designed and tested with noise and multi-talker speech background distractors. Visual signals were presented centrally and audio signals were presented between 0° and 31° audio-visual offset in azimuth. RT data were analyzed separately for the underlying Simon effect and attentional effects. In the case of the attentional effects, three models were identified but no single model could explain the observed RTs for all participants so data were grouped and analyzed accordingly. The results show that significant differences in RTs are measured from 5° to 10° onwards for the Simon effect. The attentional effect varied at the same audio-visual offset for two out of the three defined participant groups. In contrast with the prior research, these results suggest that, even for speech signals, small audio-visual offsets influence spatial integration subconsciously.
In media reproduction, there are many situations in which audio and visual signals, coming
from the same object, are presented with a spatial offset. When the offset is small enough the
spatial conflict is usually resolved by the brain, merging the different information into one
unified object; this is the so-called ventriloquism effect. With respect to evolving immersive
technologies such as virtual and augmented reality, it is important to define the maximally
accepted offset angle to create a convincing environment. However, in literature on the
ventriloquism effect, values for the maximally acceptable offset angle vary greatly.
Therefore, a series of experiments was devised to investigate the influencing factors
leading to this great variation. First, the influence of participants? background and sensory
training in hearing and vision was assessed. In a second step, the influence of the stimulus
properties such as their semantic category was examined. In both cases, a forced-choice
yes/no experiment was conducted evaluating participants? thresholds in perceived spatial
coherence. The third set of experiments strived to evaluate ventriloquism indirectly using
reaction times measurement to circumvent the observed influencing factors.
The results show that auditory sensory training greatly influences the measured offsetangles
with a nearly doubled acceptable offset angle for untrained participants (19°) compared
to musically trained ones (10°). The measured offset is further dependent on signal properties
linked to localisation precision with variations in the range of ±2°. Both findings can be
explained along the current model of bimodal spatial integration. Compared to these results,
the reaction time measurements reveal that offsets as small as 5° and less can influence
human bimodal integration independent of the sensory training.
The divergent results are discussed along the lines of the two-stream processing in the
brain for semantic and spatial information to derive recommendations for media reproduction
taking into account the different use-cases of various devices and reproduction methods.