Audio–Visual Fusion for Emotion Recognition in the Valence–Arousal Space Using Joint Cross-Attention
An efficient model-level fusion approach for continuous affect recognition from audiovisual signals
Modeling emotions using discrete categorical representation, has two main restrictions:
(1) a set of emotion categories must be defined before classification. However, most established discrete categories usually cannot cover the entire emotional space.
(2) Due to cultural differences and other factors, the same category may have different expressions, leading to ambiguity in classification. Moreover, there is no agreement on the number of categorical emotions to be analysed.