In paper: MULTIMODAL TRANSFORMER FUSION FOR CONTINUOUS EMOTION RECOGNITION
Traditional feature level fusion directly feeds the concatenated features into a classifier or uses shallow-layered fusion models [5], but it has the difficulty to learn mutual relationships among different modalities
Another alternative strategy of feature level fusion is multimodal
representation learning
The main approach is to learn joint representations from shared hidden layer connected with multiple modalities inputs
The models are usually based on deep learning frameworks, like deep autoencoder and DNN [6]. Kim et al. [7] proposed four Deep Belief Networks (DBNs) architectures to capture complex non-linear multimodal feature correlations for emotion recognition