2025 |
Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 2025. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing, speech technologies @article{CVIU_2025b, Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions. |
2009 |
Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Mihelič, France Emotion recognition using linear transformations in combination with video Proceedings Article V: Speech and intelligence: proceedings of Interspeech 2009, str. 1967-1970, Brighton, UK, 2009. Povzetek | Povezava | BibTeX | Oznake: emotion recognition, facial expression recognition, interspeech, speech, speech technologies, spontaneous emotions @inproceedings{InterSp2009, The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown. |
Gajšek, Rok; Štruc, Vitomir; Mihelič, France; Podlesek, Anja; Komidar, Luka; Sočan, Gregor; Bajec, Boštjan Multi-modal emotional database: AvID Članek v strokovni reviji V: Informatica (Ljubljana), vol. 33, no. 1, str. 101-106, 2009. Povzetek | Povezava | BibTeX | Oznake: avid, database, dataset, emotion recognition, facial expression recognition, speech, speech technologies, spontaneous emotions @article{Inform-Gajsek_2009, This paper presents our work on recording a multi-modal database containing emotional audio and video recordings. In designing the recording strategies a special attention was payed to gather data involving spontaneous emotions and therefore obtain a more realistic training and testing conditions for experiments. With specially planned scenarios including playing computer games and conducting an adaptive intelligence test different levels of arousal were induced. This will enable us to both detect different emotional states as well as experiment in speaker identification/verification of people involved in communications. So far the multi-modal database has been recorded and basic evaluation of the data was processed. |
Gajšek, Rok; Štruc, Vitomir; Vesnicer, Boštjan; Podlesek, Anja; Komidar, Luka; Mihelič, France Analysis and assessment of AvID: multi-modal emotional database Proceedings Article V: Text, speech and dialogue / 12th International Conference, str. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009. Povzetek | Povezava | BibTeX | Oznake: avid database, database, emotion recognition, multimodal database, speech, speech technologies @inproceedings{TSD2009, The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is given to the process of evaluating the emotional database. Different kappa statistics normally used in measuring the agreement between annotators are discussed. Following the problems of standard kappa coefficients, when used in emotional database assessment, a new time-weighted free-marginal kappa is presented. It differs from the other kappa statistics in that it weights each utterance's particular score of agreement based on the duration of the utterance. The new method is evaluated and the superiority over the standard kappa, when dealing with a database of spontaneous emotions, is demonstrated. |
0000 |
Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 0000. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing, speech technologies @article{CVIU_2025, Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions. |
Objave
2025 |
Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 2025. |
2009 |
Emotion recognition using linear transformations in combination with video Proceedings Article V: Speech and intelligence: proceedings of Interspeech 2009, str. 1967-1970, Brighton, UK, 2009. |
Multi-modal emotional database: AvID Članek v strokovni reviji V: Informatica (Ljubljana), vol. 33, no. 1, str. 101-106, 2009. |
Analysis and assessment of AvID: multi-modal emotional database Proceedings Article V: Text, speech and dialogue / 12th International Conference, str. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009. |
0000 |
Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 0000. |