Publications – Laboratory for Machine Intelligence

Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Journal Article

In: Computer Vision and Image Understanding, vol. 260, no. 104483, pp. 1–14, 2025.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing, speech technologies

@article{CVIU_2025b,

title = {Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy},

author = {Chenquan Gan and Daitao Zhou and Kexin Wang and Qingyi Zhu and Deepak Kumar Jain and Vitomir Štruc},

url = {https://www.sciencedirect.com/science/article/pii/S1077314225002061?dgcid=coauthor

https://lmi.fe.uni-lj.si/wp-content/uploads/2025/09/CVIU.pdf},

doi = {https://doi.org/10.1016/j.cviu.2025.104483},

year  = {2025},

date = {2025-10-01},

urldate = {2025-10-01},

journal = {Computer Vision and Image Understanding},

volume = {260},

number = { 104483},

pages = {1--14},

abstract = {Speech emotion recognition is of great significance for improving the human–computer interaction experience. However, traditional methods based on hard labels have difficulty dealing with the ambiguity of emotional expression. Existing studies alleviate this problem by redefining labels, but still rely on the subjective emotional expression of annotators and fail to consider the truly ambiguous speech samples without dominant labels fully. To solve the problems of insufficient expression of emotional labels and ignoring ambiguous undominantly labeled speech samples, we propose a label correction strategy that uses a model with exact sample knowledge to modify inappropriate labels for ambiguous speech samples, integrating model training with emotion cognition, and considering the ambiguity without dominant label samples. It is implemented on a spatial–temporal parallel network, which adopts a temporal pyramid pooling (TPP) to process the variable-length features of speech to improve the recognition efficiency of speech emotion. Through experiments, it has been shown that ambiguous speech after label correction has a more promoting effect on the recognition performance of speech emotions.},

keywords = {deep learning, emotion recognition, speech, speech processing, speech technologies},

pubstate = {published},

tppubtype = {article}

}

Close

Dobrišek, Simon; Čefarin, David; Štruc, Vitomir; Mihelič, France

Assessment of the Google Speech Application Programming Interface for Automatic Slovenian Speech Recognition Proceedings Article

In: Jezikovne Tehnologije in Digitalna Humanistika, 2016.

Abstract | Links | BibTeX | Tags: Google, performance evaluation, speech API, speech technologies

Justin, Tadej; Štruc, Vitomir; Dobrišek, Simon; Vesnicer, Boštjan; Ipšić, Ivo; Mihelič, France

Speaker de-identification using diphone recognition and speech synthesis Proceedings Article

In: 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015, pp. 1–7, IEEE 2015.

Abstract | Links | BibTeX | Tags: DEID, FG, speech deidentification, speech recognition, speech synthesis, speech technologies

@inproceedings{justin2015speaker,

title = {Speaker de-identification using diphone recognition and speech synthesis},

author = {Tadej Justin and Vitomir Štruc and Simon Dobrišek and Boštjan Vesnicer and Ivo Ipšić and France Mihelič},

url = {https://lmi.fe.uni-lj.si/en/speakerde-identificationusingdiphonerecognitionandspeechsynthesis/},

year  = {2015},

date = {2015-01-01},

urldate = {2015-01-01},

booktitle = {11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (IEEE FG): DeID 2015},

volume = {4},

pages = {1--7},

organization = {IEEE},

abstract = {The paper addresses the problem of speaker (or voice) de-identification by presenting a novel approach for concealing the identity of speakers in their speech. The proposed technique first recognizes the input speech with a diphone recognition system and then transforms the obtained phonetic transcription into the speech of another speaker with a speech synthesis system. Due to the fact that a Diphone RecOgnition step and a sPeech SYnthesis step are used during the deidentification, we refer to the developed technique as DROPSY. With this approach the acoustical models of the recognition and synthesis modules are completely independent from each other, which ensures the highest level of input speaker deidentification. The proposed DROPSY-based de-identification approach is language dependent, text independent and capable of running in real-time due to the relatively simple computing methods used. When designing speaker de-identification technology two requirements are typically imposed on the deidentification techniques: i) it should not be possible to establish the identity of the speakers based on the de-identified speech, and ii) the processed speech should still sound natural and be intelligible. This paper, therefore, implements the proposed DROPSY-based approach with two different speech synthesis techniques (i.e, with the HMM-based and the diphone TDPSOLA- based technique). The obtained de-identified speech is evaluated for intelligibility and evaluated in speaker verification experiments with a state-of-the-art (i-vector/PLDA) speaker recognition system. The comparison of both speech synthesis modules integrated in the proposed method reveals that both can efficiently de-identify the input speakers while still producing intelligible speech.},

keywords = {DEID, FG, speech deidentification, speech recognition, speech synthesis, speech technologies},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Justin, Tadej; Štruc, Vitomir; Žibert, Janez; Mihelič, France

Development and Evaluation of the Emotional Slovenian Speech Database-EmoLUKS Proceedings Article

In: Proceedings of the International Conference on Text, Speech, and Dialogue (TSD), pp. 351–359, Springer 2015.

Abstract | Links | BibTeX | Tags: annotated data, dataset, dataset of emotional speech, EmoLUKS, emotional speech synthesis, speech synthesis, speech technologies, transcriptions

Vesnicer, Boštjan; Žganec-Gros, Jerneja; Dobrišek, Simon; Štruc, Vitomir

Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems Proceedings Article

In: Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 241–248, 2014.

Abstract | Links | BibTeX | Tags: acustic features, biometrics, duration, duration modeling, i-vector, i-vector challenge, Odyssey, performance evaluation, speaker recognition, speech technologies

@inproceedings{vesnicer2014incorporating,

title = {Incorporating Duration Information into I-Vector-Based Speaker-Recognition Systems},

author = {Boštjan Vesnicer and Jerneja Žganec-Gros and Simon Dobrišek and Vitomir Štruc},

url = {https://lmi.fe.uni-lj.si/en/incorporatingdurationinformationintoi-vector-basedspeaker-recognitionsystems/},

year  = {2014},

date = {2014-01-01},

urldate = {2014-01-01},

booktitle = {Proceedings of Odyssey: The Speaker and Language Recognition Workshop},

pages = {241--248},

abstract = {Most of the existing literature on i-vector-based speaker recognition focuses on recognition problems, where i-vectors are extracted from speech recordings of sufficient length. The majority of modeling/recognition techniques therefore simply ignores the fact that the i-vectors are most likely estimated unreliably when short recordings are used for their computation. Only recently, were a number of solutions proposed in the literature to address the problem of duration variability, all treating the i-vector as a random variable whose posterior distribution can be parameterized by the posterior mean and the posterior covariance. In this setting the covariance matrix serves as a measure of uncertainty that is related to the length of the available recording. In contract to these solutions, we address the problem of duration variability through weighted statistics. We demonstrate in the paper how established feature transformation techniques regularly used in the area of speaker recognition, such as PCA or WCCN, can be modified to take duration into account. We evaluate our weighting scheme in the scope of the i-vector challenge organized as part of the Odyssey, Speaker and Language Recognition Workshop 2014 and achieve a minimal DCF of 0.280, which at the time of writing puts our approach in third place among all the participating institutions.},

keywords = {acustic features, biometrics, duration, duration modeling, i-vector, i-vector challenge, Odyssey, performance evaluation, speaker recognition, speech technologies},

pubstate = {published},

tppubtype = {inproceedings}

}

Close

Dobrišek, Simon; Gajšek, Rok; Mihelič, France; Pavešić, Nikola; Štruc, Vitomir

Towards efficient multi-modal emotion recognition Journal Article

In: International Journal of Advanced Robotic Systems, vol. 10, no. 53, 2013.

Abstract | Links | BibTeX | Tags: avid database, emotion recognition, facial expression recognition, multi modality, speech technologies

@article{dobrivsek2013towards,

title = {Towards efficient multi-modal emotion recognition},

author = {Simon Dobrišek and Rok Gajšek and France Mihelič and Nikola Pavešić and Vitomir Štruc},

url = {https://lmi.fe.uni-lj.si/en/towardsefficientmulti-modalemotionrecognition/},

doi = {10.5772/54002},

year  = {2013},

date = {2013-01-01},

urldate = {2013-01-01},

journal = {International Journal of Advanced Robotic Systems},

volume = {10},

number = {53},

abstract = {The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.},

keywords = {avid database, emotion recognition, facial expression recognition, multi modality, speech technologies},

pubstate = {published},

tppubtype = {article}

}

Close

Gajšek, Rok; Štruc, Vitomir; Mihelič, France

Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features Proceedings Article

In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 4133-4136, IAPR Istanbul, Turkey, 2010.

Abstract | Links | BibTeX | Tags: acustic features, canonical correlations, emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies

Gajšek, Rok; Štruc, Vitomir; Mihelič, France

Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information Proceedings Article

In: Proceedings of Text, Speech and Dialogue (TSD), pp. 275-282, Springer-Verlag, Berlin, Heidelberg, 2010.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies, spontaneous emotions, video processing

Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Mihelič, France

Emotion recognition using linear transformations in combination with video Proceedings Article

In: Speech and intelligence: proceedings of Interspeech 2009, pp. 1967-1970, Brighton, UK, 2009.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, interspeech, speech, speech technologies, spontaneous emotions

Gajšek, Rok; Štruc, Vitomir; Mihelič, France; Podlesek, Anja; Komidar, Luka; Sočan, Gregor; Bajec, Boštjan

Multi-modal emotional database: AvID Journal Article

In: Informatica (Ljubljana), vol. 33, no. 1, pp. 101-106, 2009.

Abstract | Links | BibTeX | Tags: avid, database, dataset, emotion recognition, facial expression recognition, speech, speech technologies, spontaneous emotions

Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Žibert, Janez; Mihelič, France; Pavešić, Nikola

Combining audio and video for detection of spontaneous emotions Proceedings Article

In: Biometric ID management and multimodal communication, pp. 114-121, Springer-Verlag, Berlin, Heidelberg, 2009.

Abstract | Links | BibTeX | Tags: emotion recognition, facial expression recognition, performance evaluation, speech processing, speech technologies

Gajšek, Rok; Štruc, Vitomir; Vesnicer, Boštjan; Podlesek, Anja; Komidar, Luka; Mihelič, France

Analysis and assessment of AvID: multi-modal emotional database Proceedings Article

In: Text, speech and dialogue / 12th International Conference, pp. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009.

Abstract | Links | BibTeX | Tags: avid database, database, emotion recognition, multimodal database, speech, speech technologies

Gan, Chenquan; Zhou, Daitao; Wang, Kexin; Zhu, Qingyi; Jain, Deepak Kumar; Štruc, Vitomir

Optimizing ambiguous speech emotion recognition through spatial–temporal parallel network with label correction strategy Journal Article

In: Computer Vision and Image Understanding, vol. 260, no. 104483, pp. 1–14, 0000.

Abstract | Links | BibTeX | Tags: deep learning, emotion recognition, speech, speech processing, speech technologies