Publications – Laboratory for Machine Intelligence

Gan, Chenquan; Zhou, Daitao; Zhu, Qingyi; Wang, Xibin; Jain, Deepak Kumar; Štruc, Vitomir

Improving Emotion Recognition from Ambiguous Speech via Spatio-Temporal Spectrum Analysis and Real-Time Soft-Label Correction Journal Article

In: IEEE Transactions on Affective Computing, pp. 1-16, 2026.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing

@article{TAC_2026,

title = {Improving Emotion Recognition from Ambiguous Speech via Spatio-Temporal Spectrum Analysis and Real-Time Soft-Label Correction},

author = {Chenquan Gan and Daitao Zhou and Qingyi Zhu and Xibin Wang and Deepak Kumar Jain and Vitomir Štruc},

url = {https://lmi.fe.uni-lj.si/wp-content/uploads/2025/12/Manuscript_clean.pdf},

year  = {2026},

date = {2026-03-01},

urldate = {2026-03-01},

journal = {IEEE Transactions on Affective Computing},

pages = {1-16},

abstract = {Speech represents a fundamental medium for conveying human emotions and, as a result, speech-based emotion recognition (SER) systems have become pivotal in advancing human-computer interaction (HCI) across a range of applications. While significant progress has been made in speech emotion recognition over recent years, existing solutions still face several key challenges, in that they: (i)  rely excessively  on subjectively annotated (discrete) labels during training, (ii)  often overlook the label ambiguity of speech samples that express more than one class of emotions, and (iii)  underutilize unlabeled or ambiguous speech, for which typically a label distribution (or so-called soft labels) is available. To address these issues, we propose in this paper a novel SER model that explicitly handles ambiguous  speech samples and overcomes the shortcomings outlined above. Central to our approach is a novel real-time soft-label correction strategy designed to refine the annotations assigned to ambiguous speech. The proposed model leverages both, (explicitly) labeled as well as ambiguous samples and applies the dynamic soft-label correction strategy alongside an enhanced inter-class difference loss function to iteratively optimize the label distributions during training. We theoretically demonstrate that our method is capable of approximating the true emotional distribution of speech even in the presence of label noise, suggesting that utilizing ambiguous speech samples without explicit emotion labels still contributes toward more effective emotion recognition. Furthermore, we integrate the representational power of convolutional neural networks (CNNs) with the contextual modeling capabilities of Wav2Vec 2.0 to enable a comprehensive extraction of spatio-temporal speech features. Experimental results on the IEMOCAP multi-label dataset confirm the effectiveness of our approach, achieving state-of-the-art performance with significant improvements in weighted accuracy (WA) and unweighted accuracy (UA) over competing methods.},

keywords = {deep learning, emotion recognition, speech, speech processing},

pubstate = {published},

tppubtype = {article}

}

Close

Speech represents a fundamental medium for conveying human emotions and, as a result, speech-based emotion recognition (SER) systems have become pivotal in advancing human-computer interaction (HCI) across a range of applications. While significant progress has been made in speech emotion recognition over recent years, existing solutions still face several key challenges, in that they: (i) rely excessively on subjectively annotated (discrete) labels during training, (ii) often overlook the label ambiguity of speech samples that express more than one class of emotions, and (iii) underutilize unlabeled or ambiguous speech, for which typically a label distribution (or so-called soft labels) is available. To address these issues, we propose in this paper a novel SER model that explicitly handles ambiguous speech samples and overcomes the shortcomings outlined above. Central to our approach is a novel real-time soft-label correction strategy designed to refine the annotations assigned to ambiguous speech. The proposed model leverages both, (explicitly) labeled as well as ambiguous samples and applies the dynamic soft-label correction strategy alongside an enhanced inter-class difference loss function to iteratively optimize the label distributions during training. We theoretically demonstrate that our method is capable of approximating the true emotional distribution of speech even in the presence of label noise, suggesting that utilizing ambiguous speech samples without explicit emotion labels still contributes toward more effective emotion recognition. Furthermore, we integrate the representational power of convolutional neural networks (CNNs) with the contextual modeling capabilities of Wav2Vec 2.0 to enable a comprehensive extraction of spatio-temporal speech features. Experimental results on the IEMOCAP multi-label dataset confirm the effectiveness of our approach, achieving state-of-the-art performance with significant improvements in weighted accuracy (WA) and unweighted accuracy (UA) over competing methods.

Close

Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Journal Article

In: Computer Vision and Image Understanding, vol. 260, no. 104483, pp. 1–14, 2025.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing, speech technologies

@article{CVIU_2025b,

title = {Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy},

author = {Chenquan Gan and Daitao Zhou and Kexin Wang and Qingyi Zhu and Deepak Kumar Jain and Vitomir Štruc},

url = {https://www.sciencedirect.com/science/article/pii/S1077314225002061?dgcid=coauthor

https://lmi.fe.uni-lj.si/wp-content/uploads/2025/09/CVIU.pdf},

doi = {https://doi.org/10.1016/j.cviu.2025.104483},

year  = {2025},

date = {2025-10-01},

urldate = {2025-10-01},

journal = {Computer Vision and Image Understanding},

volume = {260},

number = { 104483},

pages = {1--14},

abstract = {Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions.},

keywords = {deep learning, emotion recognition, speech, speech processing, speech technologies},

pubstate = {published},

tppubtype = {article}

}

Close

Gan, Chenquan; Zheng, Jiahao; Zhu, Qingyi; Jain, Deepak Kumar; Vitomir vStruc,

A graph neural network with context filtering and feature correction for conversational emotion recognition Journal Article

In: Information Sciences, vol. 658, no. 120017, pp. 1-21, 2024.

Abstract | Links | BibTeX | Tags: context filtering, conversations, dialogue, emotion recognition, graph neural network, sentiment analysis

@article{InformSciences2024,

title = {A graph neural network with context filtering and feature correction for conversational emotion recognition},

author = {Chenquan Gan and Jiahao Zheng and Qingyi Zhu and Deepak Kumar Jain and Vitomir {v{S}}truc, },

url = {https://www.sciencedirect.com/science/article/pii/S002002552301602X?via%3Dihub

https://lmi.fe.uni-lj.si/wp-content/uploads/2023/12/InformationSciences.pdf},

doi = {https://doi.org/10.1016/j.ins.2023.120017},

year  = {2024},

date = {2024-02-01},

journal = {Information Sciences},

volume = {658},

number = {120017},

pages = {1-21},

abstract = {Conversational emotion recognition represents an important machine-learning problem with a wide variety of deployment possibilities. The key challenge in this area is how to properly capture the key conversational aspects that facilitate reliable emotion recognition, including utterance semantics, temporal order, informative contextual cues, speaker interactions as well as other relevant factors. In this paper, we present a novel Graph Neural Network approach for conversational emotion recognition at the utterance level. Our method addresses the outlined challenges and represents conversations in the form of graph structures that naturally encode temporal order, speaker dependencies, and even long-distance context. To efficiently capture the semantic content of the conversations, we leverage the zero-shot feature-extraction capabilities of pre-trained large-scale language models and then integrate two key contributions into the graph neural network to ensure competitive recognition results. The first is a novel context filter that establishes meaningful utterance dependencies for the graph construction procedure and removes low-relevance and uninformative utterances from being used as a source of contextual information for the recognition task. The second contribution is a feature-correction procedure that adjusts the information content in the generated feature representations through a gating mechanism to improve their discriminative power and reduce emotion-prediction errors. We conduct extensive experiments on four commonly used conversational datasets, i.e., IEMOCAP, MELD, Dailydialog, and EmoryNLP, to demonstrate the capabilities of the developed graph neural network with context filtering and error-correction capabilities. The results of the experiments point to highly promising performance, especially when compared to state-of-the-art competitors from the literature.},

keywords = {context filtering, conversations, dialogue, emotion recognition, graph neural network, sentiment analysis},

pubstate = {published},

tppubtype = {article}

}

Close

Gan, Chenquan; Yang, Yucheng; Zhub, Qingyi; Jain, Deepak Kumar; Struc, Vitomir

DHF-Net: A hierarchical feature interactive fusion network for dialogue emotion recognition Journal Article

In: Expert Systems with Applications, vol. 210, 2022.

Abstract | Links | BibTeX | Tags: attention, CNN, deep learning, dialogue, emotion recognition, fusion, fusion network, nlp, semantics, text, text processing

Dobrišek, Simon; Gajšek, Rok; Mihelič, France; Pavešić, Nikola; Štruc, Vitomir

Towards efficient multi-modal emotion recognition Journal Article

In: International Journal of Advanced Robotic Systems, vol. 10, no. 53, 2013.

Abstract | Links | BibTeX | Tags: avid database, emotion recognition, facial expression recognition, multi modality, speech technologies

@article{dobrivsek2013towards,

title = {Towards efficient multi-modal emotion recognition},

author = {Simon Dobrišek and Rok Gajšek and France Mihelič and Nikola Pavešić and Vitomir Štruc},

url = {https://lmi.fe.uni-lj.si/en/towardsefficientmulti-modalemotionrecognition/},

doi = {10.5772/54002},

year  = {2013},

date = {2013-01-01},

urldate = {2013-01-01},

journal = {International Journal of Advanced Robotic Systems},

volume = {10},

number = {53},

abstract = {The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.},

keywords = {avid database, emotion recognition, facial expression recognition, multi modality, speech technologies},

pubstate = {published},

tppubtype = {article}

}

Close

Gajšek, Rok; Štruc, Vitomir; Mihelič, France

Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features Proceedings Article

In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 4133-4136, IAPR Istanbul, Turkey, 2010.

Abstract | Links | BibTeX | Tags: acustic features, canonical correlations, emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies

Gajšek, Rok; Štruc, Vitomir; Mihelič, France

Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information Proceedings Article

In: Proceedings of Text, Speech and Dialogue (TSD), pp. 275-282, Springer-Verlag, Berlin, Heidelberg, 2010.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies, spontaneous emotions, video processing

Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Mihelič, France

Emotion recognition using linear transformations in combination with video Proceedings Article

In: Speech and intelligence: proceedings of Interspeech 2009, pp. 1967-1970, Brighton, UK, 2009.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, interspeech, speech, speech technologies, spontaneous emotions

Gajšek, Rok; Štruc, Vitomir; Mihelič, France; Podlesek, Anja; Komidar, Luka; Sočan, Gregor; Bajec, Boštjan

Multi-modal emotional database: AvID Journal Article

In: Informatica (Ljubljana), vol. 33, no. 1, pp. 101-106, 2009.

Abstract | Links | BibTeX | Tags: avid, database, dataset, emotion recognition, facial expression recognition, speech, speech technologies, spontaneous emotions

Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Žibert, Janez; Mihelič, France; Pavešić, Nikola

Combining audio and video for detection of spontaneous emotions Proceedings Article

In: Biometric ID management and multimodal communication, pp. 114-121, Springer-Verlag, Berlin, Heidelberg, 2009.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, performance evaluation, speech processing, speech technologies

Gajšek, Rok; Štruc, Vitomir; Vesnicer, Boštjan; Podlesek, Anja; Komidar, Luka; Mihelič, France

Analysis and assessment of AvID: multi-modal emotional database Proceedings Article

In: Text, speech and dialogue / 12th International Conference, pp. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009.

Abstract | Links | BibTeX | Tags: avid database, database, emotion recognition, multimodal database, speech, speech technologies

Gajšek, Rok; Podlesek, Anja; Komidar, Luka; Sočan, Grekor; Bajec, Boštjan; Štruc, Vitomir; Bucik, Valentin; Mihelič, France

AvID: audio-video emotional database Proceedings Article

In: Proceedings of the 11th International Multi-conference Information Society (IS'08), pp. 70-74, Ljubljana, Slovenia, 2008.

BibTeX | Tags: database, dataset, emotion recognition, facial expression recognition, multimodal database, speech technology, spontaneous emotions

Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Journal Article

In: Computer Vision and Image Understanding, vol. 260, no. 104483, pp. 1–14, 0000.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing, speech technologies