Publications – Laboratory for Machine Intelligence

Gan, Chenquan; Zhou, Daitao; Zhu, Qingyi; Wang, Xibin; Jain, Deepak Kumar; Štruc, Vitomir

Improving Emotion Recognition from Ambiguous Speech via Spatio-Temporal Spectrum Analysis and Real-Time Soft-Label Correction Journal Article

In: IEEE Transactions on Affective Computing, pp. 1-16, 2026.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing

@article{TAC_2026,

title = {Improving Emotion Recognition from Ambiguous Speech via Spatio-Temporal Spectrum Analysis and Real-Time Soft-Label Correction},

author = {Chenquan Gan and Daitao Zhou and Qingyi Zhu and Xibin Wang and Deepak Kumar Jain and Vitomir Štruc},

url = {https://lmi.fe.uni-lj.si/wp-content/uploads/2025/12/Manuscript_clean.pdf},

year  = {2026},

date = {2026-03-01},

urldate = {2026-03-01},

journal = {IEEE Transactions on Affective Computing},

pages = {1-16},

abstract = {Speech represents a fundamental medium for conveying human emotions and, as a result, speech-based emotion recognition (SER) systems have become pivotal in advancing human-computer interaction (HCI) across a range of applications. While significant progress has been made in speech emotion recognition over recent years, existing solutions still face several key challenges, in that they: (i)  rely excessively  on subjectively annotated (discrete) labels during training, (ii)  often overlook the label ambiguity of speech samples that express more than one class of emotions, and (iii)  underutilize unlabeled or ambiguous speech, for which typically a label distribution (or so-called soft labels) is available. To address these issues, we propose in this paper a novel SER model that explicitly handles ambiguous  speech samples and overcomes the shortcomings outlined above. Central to our approach is a novel real-time soft-label correction strategy designed to refine the annotations assigned to ambiguous speech. The proposed model leverages both, (explicitly) labeled as well as ambiguous samples and applies the dynamic soft-label correction strategy alongside an enhanced inter-class difference loss function to iteratively optimize the label distributions during training. We theoretically demonstrate that our method is capable of approximating the true emotional distribution of speech even in the presence of label noise, suggesting that utilizing ambiguous speech samples without explicit emotion labels still contributes toward more effective emotion recognition. Furthermore, we integrate the representational power of convolutional neural networks (CNNs) with the contextual modeling capabilities of Wav2Vec 2.0 to enable a comprehensive extraction of spatio-temporal speech features. Experimental results on the IEMOCAP multi-label dataset confirm the effectiveness of our approach, achieving state-of-the-art performance with significant improvements in weighted accuracy (WA) and unweighted accuracy (UA) over competing methods.},

keywords = {deep learning, emotion recognition, speech, speech processing},

pubstate = {published},

tppubtype = {article}

}

Close

Speech represents a fundamental medium for conveying human emotions and, as a result, speech-based emotion recognition (SER) systems have become pivotal in advancing human-computer interaction (HCI) across a range of applications. While significant progress has been made in speech emotion recognition over recent years, existing solutions still face several key challenges, in that they: (i) rely excessively on subjectively annotated (discrete) labels during training, (ii) often overlook the label ambiguity of speech samples that express more than one class of emotions, and (iii) underutilize unlabeled or ambiguous speech, for which typically a label distribution (or so-called soft labels) is available. To address these issues, we propose in this paper a novel SER model that explicitly handles ambiguous speech samples and overcomes the shortcomings outlined above. Central to our approach is a novel real-time soft-label correction strategy designed to refine the annotations assigned to ambiguous speech. The proposed model leverages both, (explicitly) labeled as well as ambiguous samples and applies the dynamic soft-label correction strategy alongside an enhanced inter-class difference loss function to iteratively optimize the label distributions during training. We theoretically demonstrate that our method is capable of approximating the true emotional distribution of speech even in the presence of label noise, suggesting that utilizing ambiguous speech samples without explicit emotion labels still contributes toward more effective emotion recognition. Furthermore, we integrate the representational power of convolutional neural networks (CNNs) with the contextual modeling capabilities of Wav2Vec 2.0 to enable a comprehensive extraction of spatio-temporal speech features. Experimental results on the IEMOCAP multi-label dataset confirm the effectiveness of our approach, achieving state-of-the-art performance with significant improvements in weighted accuracy (WA) and unweighted accuracy (UA) over competing methods.

Close

Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Journal Article

In: Computer Vision and Image Understanding, vol. 260, no. 104483, pp. 1–14, 2025.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing, speech technologies

@article{CVIU_2025b,

title = {Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy},

author = {Chenquan Gan and Daitao Zhou and Kexin Wang and Qingyi Zhu and Deepak Kumar Jain and Vitomir Štruc},

url = {https://www.sciencedirect.com/science/article/pii/S1077314225002061?dgcid=coauthor

https://lmi.fe.uni-lj.si/wp-content/uploads/2025/09/CVIU.pdf},

doi = {https://doi.org/10.1016/j.cviu.2025.104483},

year  = {2025},

date = {2025-10-01},

urldate = {2025-10-01},

journal = {Computer Vision and Image Understanding},

volume = {260},

number = { 104483},

pages = {1--14},

abstract = {Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions.},

keywords = {deep learning, emotion recognition, speech, speech processing, speech technologies},

pubstate = {published},

tppubtype = {article}

}

Close

Gajšek, Rok; Štruc, Vitomir; Mihelič, France

Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features Proceedings Article

In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 4133-4136, IAPR Istanbul, Turkey, 2010.

Abstract | Links | BibTeX | Tags: acustic features, canonical correlations, emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies

Gajšek, Rok; Štruc, Vitomir; Mihelič, France

Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information Proceedings Article

In: Proceedings of Text, Speech and Dialogue (TSD), pp. 275-282, Springer-Verlag, Berlin, Heidelberg, 2010.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies, spontaneous emotions, video processing

Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Žibert, Janez; Mihelič, France; Pavešić, Nikola

Combining audio and video for detection of spontaneous emotions Proceedings Article

In: Biometric ID management and multimodal communication, pp. 114-121, Springer-Verlag, Berlin, Heidelberg, 2009.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, performance evaluation, speech processing, speech technologies

Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Journal Article

In: Computer Vision and Image Understanding, vol. 260, no. 104483, pp. 1–14, 0000.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing, speech technologies