Multimodal fusion for emotion recognition concerns the family of machine learning approaches that integrate information from multiple modalities in order to predict an outcome measure. Such is usually either a class with a discrete value (e.g., happy vs. sad), or a continuous value (e.g., the level of arousal/valence). Several literature review papers survey existing approaches for multimodal emotion recognition.
There are three key aspects to any multimodal fusion approach:
Leveraging recent advances in deep learning for audio-Visual emotion recognition
(i) which features to extract,
(ii) how to fuse the features,
(iii) how to capture the temporal dynamics.
Deep Auto-Encoders With Sequential Learning for multimodal Dimensional Emotion Recognition
(i) how to simultaneously learn compact yet representative features from multimodal data
(ii) how to effectively capture complementary features from multimodal streams
(iii) howto perform all the tasks in an end-to-end manner
several handcrafted features have been designed for AVER. These low-level descriptors concern mainly geometric features like facial landmarks.
Meanwhile, commonly-used audio signal features include spectral, cepstral, prosodic, and voice quality features. Recently, deep neural network-based features have become more popular for AVER. These deep learning-based approaches fall into two main categories. In the first, several handcrafted features are extracted from the video and audio signals and then fed to the deep neural network. In the second category, raw visual and audio signals are fed to the deep network. Deep convolutional neural networks (CNNs) have been observed as outperforming other AVER methods.
An important consideration in multimodal emotion recognition concerns the way in which the audio and visual features are fused together.
Four types of strategy are reported in the literature:
Feature-level fusion also called early-fusion concerns approaches where features are immediately integrated after extraction via simple concatenation into a single high-dimensional feature vector.
Such is the most common strategy for multimodal emotion recognition.
Decision-level fusion or late fusion concerns approaches that perform fusion after an independent prediction is made by a separate model for each modality. In the audio-visual case, this typically means taking the predictions from an audio-only model, and the prediction from a visual-only model, and applying an algebraic combination rule of the multiple predicted class labels such as ’min’, ’sum’, and so on. Score-level fusion is a subfamily of the decision-level family that employs an equally weighted summation of the individual unimodal predictors.
Hybrid fusion combines outputs from early fusion and from individual classification scores of each modality.
Model-level fusion aims to learn a joint representation of the multiple input modalities by first concatenating the input feature representations, and then passing these through a model that computes a learned, internal representation prior to making its prediction.
In this family of approaches, multiple kernel learning, and graphical models have been studied, in addition to neural network-based approaches.
audio-visual data represents a dynamic set of signals across both spatial and temporal dimensions.
identify three distinct methods by which deep learning is typically used to model these signals:
Spatial feature representations: concerns learning features from individual images or very short image sequences, or from short periods of audio.
Temporal feature representations: where sequences of audio or image inputs serve as the model’s input. It has been demonstrated that deep neural networks and especially recurrent neural networks are capable of capturing the temporal dynamics of such sequences
Joint feature representations: in these approaches, the features from
unimodal approaches are combined. Once features are extracted from multiple modalities at multiple time points, they are fused using one of strategies of modality fusion
Leveraging recent advances in deep learning for audio-Visual emotion recognition 논문 related work
Deep Auto-Encoders With Sequential Learning for multimodal Dimensional Emotion Recognition 논문 abstract