Multi-modal ER task Key aspect

꼼댕이·2023년 9월 1일

Affective Computing

목록 보기

2/13

Multimodal fusion for emotion recognition concerns the family of machine learning approaches that integrate information from multiple modalities in order to predict an outcome measure. Such is usually either a class with a discrete value (e.g., happy vs. sad), or a continuous value (e.g., the level of arousal/valence). Several literature review papers survey existing approaches for multimodal emotion recognition.

There are three key aspects to any multimodal fusion approach:

Leveraging recent advances in deep learning for audio-Visual emotion recognition

(i) which features to extract,
(ii) how to fuse the features,
(iii) how to capture the temporal dynamics.

Deep Auto-Encoders With Sequential Learning for multimodal Dimensional Emotion Recognition

(i) how to simultaneously learn compact yet representative features from multimodal data
(ii) how to effectively capture complementary features from multimodal streams
(iii) howto perform all the tasks in an end-to-end manner

Extracted features

several handcrafted features have been designed for AVER. These low-level descriptors concern mainly geometric features like facial landmarks.

Meanwhile, commonly-used audio signal features include spectral, cepstral, prosodic, and voice quality features. Recently, deep neural network-based features have become more popular for AVER. These deep learning-based approaches fall into two main categories. In the first, several handcrafted features are extracted from the video and audio signals and then fed to the deep neural network. In the second category, raw visual and audio signals are fed to the deep network. Deep convolutional neural networks (CNNs) have been observed as outperforming other AVER methods.

Multimodal features fusion

An important consideration in multimodal emotion recognition concerns the way in which the audio and visual features are fused together.

Four types of strategy are reported in the literature:

feature-level fusion (==Ealry Fusion)
decision-level fusion (== Late Fusion)
hybrid fusion
model-level fusion

Feature-level fusion also called early-fusion concerns approaches where features are immediately integrated after extraction via simple concatenation into a single high-dimensional feature vector.

Such is the most common strategy for multimodal emotion recognition.

Decision-level fusion or late fusion concerns approaches that perform fusion after an independent prediction is made by a separate model for each modality. In the audio-visual case, this typically means taking the predictions from an audio-only model, and the prediction from a visual-only model, and applying an algebraic combination rule of the multiple predicted class labels such as ’min’, ’sum’, and so on. Score-level fusion is a subfamily of the decision-level family that employs an equally weighted summation of the individual unimodal predictors.

Hybrid fusion combines outputs from early fusion and from individual classification scores of each modality.

Model-level fusion aims to learn a joint representation of the multiple input modalities by first concatenating the input feature representations, and then passing these through a model that computes a learned, internal representation prior to making its prediction.

In this family of approaches, multiple kernel learning, and graphical models have been studied, in addition to neural network-based approaches.

Modelling temporal dynamics

audio-visual data represents a dynamic set of signals across both spatial and temporal dimensions.

identify three distinct methods by which deep learning is typically used to model these signals:

Spatial feature representations: concerns learning features from individual images or very short image sequences, or from short periods of audio.
Temporal feature representations: where sequences of audio or image inputs serve as the model’s input. It has been demonstrated that deep neural networks and especially recurrent neural networks are capable of capturing the temporal dynamics of such sequences
Joint feature representations: in these approaches, the features from
unimodal approaches are combined. Once features are extracted from multiple modalities at multiple time points, they are fused using one of strategies of modality fusion

참고

Leveraging recent advances in deep learning for audio-Visual emotion recognition 논문 related work
Deep Auto-Encoders With Sequential Learning for multimodal Dimensional Emotion Recognition 논문 abstract

꼼댕이

사람을 연구하는 공돌이

이전 포스트

Concordance Coefficient Correlation(CCC)

다음 포스트