2025 |
Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 2025. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing, speech technologies @article{CVIU_2025b, Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions. |
2010 |
Gajšek, Rok; Štruc, Vitomir; Mihelič, France Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features Proceedings Article V: Proceedings of the International Conference on Pattern Recognition (ICPR), str. 4133-4136, IAPR Istanbul, Turkey, 2010. Povzetek | Povezava | BibTeX | Oznake: acustic features, canonical correlations, emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies @inproceedings{ICPR_Gajsek_2010, The information of the psycho-physical state of the subject is becoming a valuable addition to the modern audio or video recognition systems. As well as enabling a better user experience, it can also assist in superior recognition accuracy of the base system. In the article, we present our approach to multi-modal (audio-video) emotion recognition system. For audio sub-system, a feature set comprised of prosodic, spectral and cepstrum features is selected and support vector classifier is used to produce the scores for each emotional category. For video sub-system a novel approach is presented, which does not rely on the tracking of specific facial landmarks and thus, eliminates the problems usually caused, if the tracking algorithm fails at detecting the correct area. The system is evaluated on the eNTERFACE database and the recognition accuracy of our audio-video fusion is compared to the published results in the literature. |
Gajšek, Rok; Štruc, Vitomir; Mihelič, France Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information Proceedings Article V: Proceedings of Text, Speech and Dialogue (TSD), str. 275-282, Springer-Verlag, Berlin, Heidelberg, 2010. Povzetek | Povezava | BibTeX | Oznake: emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies, spontaneous emotions, video processing @inproceedings{TSD_Emo_Gajsek, The standard features used in emotion recognition carry, besides the emotion related information, also cues about the speaker. This is expected, since the nature of emotionally colored speech is similar to the variations in the speech signal, caused by different speakers. Therefore, we present a gradient descent derived transformation for the decoupling of emotion and speaker information contained in the acoustic features. The Interspeech ’09 Emotion Challenge feature set is used as the baseline for the audio part. A similar procedure is employed on the video signal, where the nuisance attribute projection (NAP) is used to derive the transformation matrix, which contains information about the emotional state of the speaker. Ultimately, different NAP transformation matrices are compared using canonical correlations. The audio and video sub-systems are combined at the matching score level using different fusion techniques. The presented system is assessed on the publicly available eNTERFACE’05 database where significant improvements in the recognition performance are observed when compared to the stat-of-the-art baseline. |
2009 |
Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Žibert, Janez; Mihelič, France; Pavešić, Nikola Combining audio and video for detection of spontaneous emotions Proceedings Article V: Biometric ID management and multimodal communication, str. 114-121, Springer-Verlag, Berlin, Heidelberg, 2009. Povzetek | Povezava | BibTeX | Oznake: emotion recognition, facial expression recognition, performance evaluation, speech processing, speech technologies @inproceedings{BioID_Multi2009b, The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented. |
0000 |
Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 0000. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing, speech technologies @article{CVIU_2025, Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions. |
Objave
2025 |
Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 2025. |
2010 |
Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features Proceedings Article V: Proceedings of the International Conference on Pattern Recognition (ICPR), str. 4133-4136, IAPR Istanbul, Turkey, 2010. |
Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information Proceedings Article V: Proceedings of Text, Speech and Dialogue (TSD), str. 275-282, Springer-Verlag, Berlin, Heidelberg, 2010. |
2009 |
Combining audio and video for detection of spontaneous emotions Proceedings Article V: Biometric ID management and multimodal communication, str. 114-121, Springer-Verlag, Berlin, Heidelberg, 2009. |
0000 |
Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 0000. |