2026 |
Gan, Chenquan; Zhou, Daitao; Zhu, Qingyi; Wang, Xibin; Jain, Deepak Kumar; Štruc, Vitomir Improving Emotion Recognition from Ambiguous Speech via Spatio-Temporal Spectrum Analysis and Real-Time Soft-Label Correction Članek v strokovni reviji V: IEEE Transactions on Affective Computing, str. 1-16, 2026. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing @article{TAC_2026,Speech represents a fundamental medium for conveying human emotions and, as a result, speech-based emotion recognition (SER) systems have become pivotal in advancing human-computer interaction (HCI) across a range of applications. While significant progress has been made in speech emotion recognition over recent years, existing solutions still face several key challenges, in that they: (i) rely excessively on subjectively annotated (discrete) labels during training, (ii) often overlook the label ambiguity of speech samples that express more than one class of emotions, and (iii) underutilize unlabeled or ambiguous speech, for which typically a label distribution (or so-called soft labels) is available. To address these issues, we propose in this paper a novel SER model that explicitly handles ambiguous speech samples and overcomes the shortcomings outlined above. Central to our approach is a novel real-time soft-label correction strategy designed to refine the annotations assigned to ambiguous speech. The proposed model leverages both, (explicitly) labeled as well as ambiguous samples and applies the dynamic soft-label correction strategy alongside an enhanced inter-class difference loss function to iteratively optimize the label distributions during training. We theoretically demonstrate that our method is capable of approximating the true emotional distribution of speech even in the presence of label noise, suggesting that utilizing ambiguous speech samples without explicit emotion labels still contributes toward more effective emotion recognition. Furthermore, we integrate the representational power of convolutional neural networks (CNNs) with the contextual modeling capabilities of Wav2Vec 2.0 to enable a comprehensive extraction of spatio-temporal speech features. Experimental results on the IEMOCAP multi-label dataset confirm the effectiveness of our approach, achieving state-of-the-art performance with significant improvements in weighted accuracy (WA) and unweighted accuracy (UA) over competing methods. |
2025 |
Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 2025. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing, speech technologies @article{CVIU_2025b,Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions. |
2009 |
Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Mihelič, France Emotion recognition using linear transformations in combination with video Proceedings Article V: Speech and intelligence: proceedings of Interspeech 2009, str. 1967-1970, Brighton, UK, 2009. Povzetek | Povezava | BibTeX | Oznake: emotion recognition, facial expression recognition, interspeech, speech, speech technologies, spontaneous emotions @inproceedings{InterSp2009,The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown. |
Gajšek, Rok; Štruc, Vitomir; Mihelič, France; Podlesek, Anja; Komidar, Luka; Sočan, Gregor; Bajec, Boštjan Multi-modal emotional database: AvID Članek v strokovni reviji V: Informatica (Ljubljana), vol. 33, no. 1, str. 101-106, 2009. Povzetek | Povezava | BibTeX | Oznake: avid, database, dataset, emotion recognition, facial expression recognition, speech, speech technologies, spontaneous emotions @article{Inform-Gajsek_2009,This paper presents our work on recording a multi-modal database containing emotional audio and video recordings. In designing the recording strategies a special attention was payed to gather data involving spontaneous emotions and therefore obtain a more realistic training and testing conditions for experiments. With specially planned scenarios including playing computer games and conducting an adaptive intelligence test different levels of arousal were induced. This will enable us to both detect different emotional states as well as experiment in speaker identification/verification of people involved in communications. So far the multi-modal database has been recorded and basic evaluation of the data was processed. |
Gajšek, Rok; Štruc, Vitomir; Vesnicer, Boštjan; Podlesek, Anja; Komidar, Luka; Mihelič, France Analysis and assessment of AvID: multi-modal emotional database Proceedings Article V: Text, speech and dialogue / 12th International Conference, str. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009. Povzetek | Povezava | BibTeX | Oznake: avid database, database, emotion recognition, multimodal database, speech, speech technologies @inproceedings{TSD2009,The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is given to the process of evaluating the emotional database. Different kappa statistics normally used in measuring the agreement between annotators are discussed. Following the problems of standard kappa coefficients, when used in emotional database assessment, a new time-weighted free-marginal kappa is presented. It differs from the other kappa statistics in that it weights each utterance's particular score of agreement based on the duration of the utterance. The new method is evaluated and the superiority over the standard kappa, when dealing with a database of spontaneous emotions, is demonstrated. |
0000 |
Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 0000. Povzetek | Povezava | BibTeX | Oznake: deep learning, emotion recognition, speech, speech processing, speech technologies @article{CVIU_2025,Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions. |
Objave
2026 |
Improving Emotion Recognition from Ambiguous Speech via Spatio-Temporal Spectrum Analysis and Real-Time Soft-Label Correction Članek v strokovni reviji V: IEEE Transactions on Affective Computing, str. 1-16, 2026. |
2025 |
Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 2025. |
2009 |
Emotion recognition using linear transformations in combination with video Proceedings Article V: Speech and intelligence: proceedings of Interspeech 2009, str. 1967-1970, Brighton, UK, 2009. |
Multi-modal emotional database: AvID Članek v strokovni reviji V: Informatica (Ljubljana), vol. 33, no. 1, str. 101-106, 2009. |
Analysis and assessment of AvID: multi-modal emotional database Proceedings Article V: Text, speech and dialogue / 12th International Conference, str. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009. |
0000 |
Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Članek v strokovni reviji V: Computer Vision and Image Understanding, vol. 260, no. 104483, str. 1–14, 0000. |