2024
|
Gan, Chenquan; Zheng, Jiahao; Zhu, Qingyi; Jain, Deepak Kumar; Vitomir vStruc, A graph neural network with context filtering and feature correction for conversational emotion recognition Journal Article In: Information Sciences, vol. 658, no. 120017, pp. 1-21, 2024. @article{InformSciences2024,
title = {A graph neural network with context filtering and feature correction for conversational emotion recognition},
author = {Chenquan Gan and Jiahao Zheng and Qingyi Zhu and Deepak Kumar Jain and Vitomir {v{S}}truc, },
url = {https://www.sciencedirect.com/science/article/pii/S002002552301602X?via%3Dihub
https://lmi.fe.uni-lj.si/wp-content/uploads/2023/12/InformationSciences.pdf},
doi = {https://doi.org/10.1016/j.ins.2023.120017},
year = {2024},
date = {2024-02-01},
journal = {Information Sciences},
volume = {658},
number = {120017},
pages = {1-21},
abstract = {Conversational emotion recognition represents an important machine-learning problem with a wide variety of deployment possibilities. The key challenge in this area is how to properly capture the key conversational aspects that facilitate reliable emotion recognition, including utterance semantics, temporal order, informative contextual cues, speaker interactions as well as other relevant factors. In this paper, we present a novel Graph Neural Network approach for conversational emotion recognition at the utterance level. Our method addresses the outlined challenges and represents conversations in the form of graph structures that naturally encode temporal order, speaker dependencies, and even long-distance context. To efficiently capture the semantic content of the conversations, we leverage the zero-shot feature-extraction capabilities of pre-trained large-scale language models and then integrate two key contributions into the graph neural network to ensure competitive recognition results. The first is a novel context filter that establishes meaningful utterance dependencies for the graph construction procedure and removes low-relevance and uninformative utterances from being used as a source of contextual information for the recognition task. The second contribution is a feature-correction procedure that adjusts the information content in the generated feature representations through a gating mechanism to improve their discriminative power and reduce emotion-prediction errors. We conduct extensive experiments on four commonly used conversational datasets, i.e., IEMOCAP, MELD, Dailydialog, and EmoryNLP, to demonstrate the capabilities of the developed graph neural network with context filtering and error-correction capabilities. The results of the experiments point to highly promising performance, especially when compared to state-of-the-art competitors from the literature.},
keywords = {context filtering, conversations, dialogue, emotion recognition, graph neural network, sentiment analysis},
pubstate = {published},
tppubtype = {article}
}
Conversational emotion recognition represents an important machine-learning problem with a wide variety of deployment possibilities. The key challenge in this area is how to properly capture the key conversational aspects that facilitate reliable emotion recognition, including utterance semantics, temporal order, informative contextual cues, speaker interactions as well as other relevant factors. In this paper, we present a novel Graph Neural Network approach for conversational emotion recognition at the utterance level. Our method addresses the outlined challenges and represents conversations in the form of graph structures that naturally encode temporal order, speaker dependencies, and even long-distance context. To efficiently capture the semantic content of the conversations, we leverage the zero-shot feature-extraction capabilities of pre-trained large-scale language models and then integrate two key contributions into the graph neural network to ensure competitive recognition results. The first is a novel context filter that establishes meaningful utterance dependencies for the graph construction procedure and removes low-relevance and uninformative utterances from being used as a source of contextual information for the recognition task. The second contribution is a feature-correction procedure that adjusts the information content in the generated feature representations through a gating mechanism to improve their discriminative power and reduce emotion-prediction errors. We conduct extensive experiments on four commonly used conversational datasets, i.e., IEMOCAP, MELD, Dailydialog, and EmoryNLP, to demonstrate the capabilities of the developed graph neural network with context filtering and error-correction capabilities. The results of the experiments point to highly promising performance, especially when compared to state-of-the-art competitors from the literature. |
2022
|
Gan, Chenquan; Yang, Yucheng; Zhub, Qingyi; Jain, Deepak Kumar; Struc, Vitomir DHF-Net: A hierarchical feature interactive fusion network for dialogue emotion recognition Journal Article In: Expert Systems with Applications, vol. 210, 2022. @article{TextEmotionESWA,
title = {DHF-Net: A hierarchical feature interactive fusion network for dialogue emotion recognition},
author = {Chenquan Gan and Yucheng Yang and Qingyi Zhub and Deepak Kumar Jain and Vitomir Struc},
url = {https://www.sciencedirect.com/science/article/pii/S0957417422016025?via%3Dihub},
doi = {https://doi.org/10.1016/j.eswa.2022.118525},
year = {2022},
date = {2022-12-30},
urldate = {2022-08-01},
journal = {Expert Systems with Applications},
volume = {210},
abstract = {To balance the trade-off between contextual information and fine-grained information in identifying specific emotions during a dialogue and combine the interaction of hierarchical feature related information, this paper proposes a hierarchical feature interactive fusion network (named DHF-Net), which not only can retain the integrity of the context sequence information but also can extract more fine-grained information. To obtain a deep semantic information, DHF-Net processes the task of recognizing dialogue emotion and dialogue act/intent separately, and then learns the cross-impact of two tasks through collaborative attention. Also, a bidirectional gate recurrent unit (Bi-GRU) connected hybrid convolutional neural network (CNN) group method is designed, by which the sequence information is smoothly sent to the multi-level local information layers for feature exaction. Experimental results show that, on two open session datasets, the performance of DHF-Net is improved by 1.8% and 1.2%, respectively.},
keywords = {attention, CNN, deep learning, dialogue, emotion recognition, fusion, fusion network, nlp, semantics, text, text processing},
pubstate = {published},
tppubtype = {article}
}
To balance the trade-off between contextual information and fine-grained information in identifying specific emotions during a dialogue and combine the interaction of hierarchical feature related information, this paper proposes a hierarchical feature interactive fusion network (named DHF-Net), which not only can retain the integrity of the context sequence information but also can extract more fine-grained information. To obtain a deep semantic information, DHF-Net processes the task of recognizing dialogue emotion and dialogue act/intent separately, and then learns the cross-impact of two tasks through collaborative attention. Also, a bidirectional gate recurrent unit (Bi-GRU) connected hybrid convolutional neural network (CNN) group method is designed, by which the sequence information is smoothly sent to the multi-level local information layers for feature exaction. Experimental results show that, on two open session datasets, the performance of DHF-Net is improved by 1.8% and 1.2%, respectively. |
2013
|
Dobrišek, Simon; Gajšek, Rok; Mihelič, France; Pavešić, Nikola; Štruc, Vitomir Towards efficient multi-modal emotion recognition Journal Article In: International Journal of Advanced Robotic Systems, vol. 10, no. 53, 2013. @article{dobrivsek2013towards,
title = {Towards efficient multi-modal emotion recognition},
author = {Simon Dobrišek and Rok Gajšek and France Mihelič and Nikola Pavešić and Vitomir Štruc},
url = {https://lmi.fe.uni-lj.si/en/towardsefficientmulti-modalemotionrecognition/},
doi = {10.5772/54002},
year = {2013},
date = {2013-01-01},
urldate = {2013-01-01},
journal = {International Journal of Advanced Robotic Systems},
volume = {10},
number = {53},
abstract = {The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.},
keywords = {avid database, emotion recognition, facial expression recognition, multi modality, speech technologies},
pubstate = {published},
tppubtype = {article}
}
The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike. |
2010
|
Gajšek, Rok; Štruc, Vitomir; Mihelič, France Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features Proceedings Article In: Proceedings of the International Conference on Pattern Recognition (ICPR), pp. 4133-4136, IAPR Istanbul, Turkey, 2010. @inproceedings{ICPR_Gajsek_2010,
title = {Multi-modal Emotion Recognition using Canonical Correlations and Acustic Features},
author = {Rok Gajšek and Vitomir Štruc and France Mihelič},
url = {https://lmi.fe.uni-lj.si/en/multi-modalemotionrecognitionusingcanonicalcorrelationsandacusticfeatures/},
year = {2010},
date = {2010-01-01},
urldate = {2010-01-01},
booktitle = {Proceedings of the International Conference on Pattern Recognition (ICPR)},
pages = {4133-4136},
address = {Istanbul, Turkey},
organization = {IAPR},
abstract = {The information of the psycho-physical state of the subject is becoming a valuable addition to the modern audio or video recognition systems. As well as enabling a better user experience, it can also assist in superior recognition accuracy of the base system. In the article, we present our approach to multi-modal (audio-video) emotion recognition system. For audio sub-system, a feature set comprised of prosodic, spectral and cepstrum features is selected and support vector classifier is used to produce the scores for each emotional category. For video sub-system a novel approach is presented, which does not rely on the tracking of specific facial landmarks and thus, eliminates the problems usually caused, if the tracking algorithm fails at detecting the correct area. The system is evaluated on the eNTERFACE database and the recognition accuracy of our audio-video fusion is compared to the published results in the literature.},
keywords = {acustic features, canonical correlations, emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies},
pubstate = {published},
tppubtype = {inproceedings}
}
The information of the psycho-physical state of the subject is becoming a valuable addition to the modern audio or video recognition systems. As well as enabling a better user experience, it can also assist in superior recognition accuracy of the base system. In the article, we present our approach to multi-modal (audio-video) emotion recognition system. For audio sub-system, a feature set comprised of prosodic, spectral and cepstrum features is selected and support vector classifier is used to produce the scores for each emotional category. For video sub-system a novel approach is presented, which does not rely on the tracking of specific facial landmarks and thus, eliminates the problems usually caused, if the tracking algorithm fails at detecting the correct area. The system is evaluated on the eNTERFACE database and the recognition accuracy of our audio-video fusion is compared to the published results in the literature. |
Gajšek, Rok; Štruc, Vitomir; Mihelič, France Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information Proceedings Article In: Proceedings of Text, Speech and Dialogue (TSD), pp. 275-282, Springer-Verlag, Berlin, Heidelberg, 2010. @inproceedings{TSD_Emo_Gajsek,
title = {Multi-modal Emotion Recognition based on the Decoupling of Emotion and Speaker Information},
author = {Rok Gajšek and Vitomir Štruc and France Mihelič},
url = {https://lmi.fe.uni-lj.si/en/multi-modalemotionrecognitionbasedonthedecouplingofemotionandspeakerinformation/},
year = {2010},
date = {2010-01-01},
urldate = {2010-01-01},
booktitle = {Proceedings of Text, Speech and Dialogue (TSD)},
volume = {6231/2010},
pages = {275-282},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
series = {Lecture Notes on Computer Science},
abstract = {The standard features used in emotion recognition carry, besides the emotion related information, also cues about the speaker. This is expected, since the nature of emotionally colored speech is similar to the variations in the speech signal, caused by different speakers. Therefore, we present a gradient descent derived transformation for the decoupling of emotion and speaker information contained in the acoustic features. The Interspeech ’09 Emotion Challenge feature set is used as the baseline for the audio part. A similar procedure is employed on the video signal, where the nuisance attribute projection (NAP) is used to derive the transformation matrix, which contains information about the emotional state of the speaker. Ultimately, different NAP transformation matrices are compared using canonical correlations. The audio and video sub-systems are combined at the matching score level using different fusion techniques. The presented system is assessed on the publicly available eNTERFACE’05 database where significant improvements in the recognition performance are observed when compared to the stat-of-the-art baseline.},
keywords = {emotion recognition, facial expression recognition, multi modality, speech processing, speech technologies, spontaneous emotions, video processing},
pubstate = {published},
tppubtype = {inproceedings}
}
The standard features used in emotion recognition carry, besides the emotion related information, also cues about the speaker. This is expected, since the nature of emotionally colored speech is similar to the variations in the speech signal, caused by different speakers. Therefore, we present a gradient descent derived transformation for the decoupling of emotion and speaker information contained in the acoustic features. The Interspeech ’09 Emotion Challenge feature set is used as the baseline for the audio part. A similar procedure is employed on the video signal, where the nuisance attribute projection (NAP) is used to derive the transformation matrix, which contains information about the emotional state of the speaker. Ultimately, different NAP transformation matrices are compared using canonical correlations. The audio and video sub-systems are combined at the matching score level using different fusion techniques. The presented system is assessed on the publicly available eNTERFACE’05 database where significant improvements in the recognition performance are observed when compared to the stat-of-the-art baseline. |
2009
|
Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Mihelič, France Emotion recognition using linear transformations in combination with video Proceedings Article In: Speech and intelligence: proceedings of Interspeech 2009, pp. 1967-1970, Brighton, UK, 2009. @inproceedings{InterSp2009,
title = {Emotion recognition using linear transformations in combination with video},
author = {Rok Gajšek and Vitomir Štruc and Simon Dobrišek and France Mihelič},
url = {https://lmi.fe.uni-lj.si/en/emotionrecognitionusinglineartransformationsincombinationwithvideo/},
year = {2009},
date = {2009-09-01},
urldate = {2009-09-01},
booktitle = {Speech and intelligence: proceedings of Interspeech 2009},
pages = {1967-1970},
address = {Brighton, UK},
abstract = {The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown.},
keywords = {emotion recognition, facial expression recognition, interspeech, speech, speech technologies, spontaneous emotions},
pubstate = {published},
tppubtype = {inproceedings}
}
The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown. |
Gajšek, Rok; Štruc, Vitomir; Mihelič, France; Podlesek, Anja; Komidar, Luka; Sočan, Gregor; Bajec, Boštjan Multi-modal emotional database: AvID Journal Article In: Informatica (Ljubljana), vol. 33, no. 1, pp. 101-106, 2009. @article{Inform-Gajsek_2009,
title = {Multi-modal emotional database: AvID},
author = {Rok Gajšek and Vitomir Štruc and France Mihelič and Anja Podlesek and Luka Komidar and Gregor Sočan and Boštjan Bajec},
url = {https://lmi.fe.uni-lj.si/en/multi-modalemotionaldatabaseavid/},
year = {2009},
date = {2009-01-01},
urldate = {2009-01-01},
journal = {Informatica (Ljubljana)},
volume = {33},
number = {1},
pages = {101-106},
abstract = {This paper presents our work on recording a multi-modal database containing emotional audio and video recordings. In designing the recording strategies a special attention was payed to gather data involving spontaneous emotions and therefore obtain a more realistic training and testing conditions for experiments. With specially planned scenarios including playing computer games and conducting an adaptive intelligence test different levels of arousal were induced. This will enable us to both detect different emotional states as well as experiment in speaker identification/verification of people involved in communications. So far the multi-modal database has been recorded and basic evaluation of the data was processed.},
keywords = {avid, database, dataset, emotion recognition, facial expression recognition, speech, speech technologies, spontaneous emotions},
pubstate = {published},
tppubtype = {article}
}
This paper presents our work on recording a multi-modal database containing emotional audio and video recordings. In designing the recording strategies a special attention was payed to gather data involving spontaneous emotions and therefore obtain a more realistic training and testing conditions for experiments. With specially planned scenarios including playing computer games and conducting an adaptive intelligence test different levels of arousal were induced. This will enable us to both detect different emotional states as well as experiment in speaker identification/verification of people involved in communications. So far the multi-modal database has been recorded and basic evaluation of the data was processed. |
Gajšek, Rok; Štruc, Vitomir; Dobrišek, Simon; Žibert, Janez; Mihelič, France; Pavešić, Nikola Combining audio and video for detection of spontaneous emotions Proceedings Article In: Biometric ID management and multimodal communication, pp. 114-121, Springer-Verlag, Berlin, Heidelberg, 2009. @inproceedings{BioID_Multi2009b,
title = {Combining audio and video for detection of spontaneous emotions},
author = {Rok Gajšek and Vitomir Štruc and Simon Dobrišek and Janez Žibert and France Mihelič and Nikola Pavešić},
url = {https://lmi.fe.uni-lj.si/en/combiningaudioandvideofordetectionofspontaneousemotions/},
year = {2009},
date = {2009-01-01},
urldate = {2009-01-01},
booktitle = {Biometric ID management and multimodal communication},
volume = {5707},
pages = {114-121},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
series = {Lecture Notes on Computer Science},
abstract = {The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented.},
keywords = {emotion recognition, facial expression recognition, performance evaluation, speech processing, speech technologies},
pubstate = {published},
tppubtype = {inproceedings}
}
The paper presents our initial attempts in building an audio video emotion recognition system. Both, audio and video sub-systems are discussed, and description of the database of spontaneous emotions is given. The task of labelling the recordings from the database according to different emotions is discussed and the measured agreement between multiple annotators is presented. Instead of focusing on the prosody in audio emotion recognition, we evaluate the possibility of using linear transformations (CMLLR) as features. The classification results from audio and video sub-systems are combined using sum rule fusion and the increase in recognition results, when using both modalities, is presented. |
Gajšek, Rok; Štruc, Vitomir; Vesnicer, Boštjan; Podlesek, Anja; Komidar, Luka; Mihelič, France Analysis and assessment of AvID: multi-modal emotional database Proceedings Article In: Text, speech and dialogue / 12th International Conference, pp. 266-273, Springer-Verlag, Berlin, Heidelberg, 2009. @inproceedings{TSD2009,
title = {Analysis and assessment of AvID: multi-modal emotional database},
author = {Rok Gajšek and Vitomir Štruc and Boštjan Vesnicer and Anja Podlesek and Luka Komidar and France Mihelič},
url = {https://lmi.fe.uni-lj.si/en/analysisandassessmentofavidmulti-modalemotionaldatabase/},
year = {2009},
date = {2009-01-01},
urldate = {2009-01-01},
booktitle = {Text, speech and dialogue / 12th International Conference},
volume = {5729},
pages = {266-273},
publisher = {Springer-Verlag},
address = {Berlin, Heidelberg},
series = {Lecture Notes on Computer Science},
abstract = {The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is given to the process of evaluating the emotional database. Different kappa statistics normally used in measuring the agreement between annotators are discussed. Following the problems of standard kappa coefficients, when used in emotional database assessment, a new time-weighted free-marginal kappa is presented. It differs from the other kappa statistics in that it weights each utterance's particular score of agreement based on the duration of the utterance. The new method is evaluated and the superiority over the standard kappa, when dealing with a database of spontaneous emotions, is demonstrated.},
keywords = {avid database, database, emotion recognition, multimodal database, speech, speech technologies},
pubstate = {published},
tppubtype = {inproceedings}
}
The paper deals with the recording and the evaluation of a multi modal (audio/video) database of spontaneous emotions. Firstly, motivation for this work is given and different recording strategies used are described. Special attention is given to the process of evaluating the emotional database. Different kappa statistics normally used in measuring the agreement between annotators are discussed. Following the problems of standard kappa coefficients, when used in emotional database assessment, a new time-weighted free-marginal kappa is presented. It differs from the other kappa statistics in that it weights each utterance's particular score of agreement based on the duration of the utterance. The new method is evaluated and the superiority over the standard kappa, when dealing with a database of spontaneous emotions, is demonstrated. |
2008
|
Gajšek, Rok; Podlesek, Anja; Komidar, Luka; Sočan, Grekor; Bajec, Boštjan; Štruc, Vitomir; Bucik, Valentin; Mihelič, France AvID: audio-video emotional database Proceedings Article In: Proceedings of the 11th International Multi-conference Information Society (IS'08), pp. 70-74, Ljubljana, Slovenia, 2008. @inproceedings{JJ2008,
title = {AvID: audio-video emotional database},
author = {Rok Gajšek and Anja Podlesek and Luka Komidar and Grekor Sočan and Boštjan Bajec and Vitomir Štruc and Valentin Bucik and France Mihelič},
year = {2008},
date = {2008-01-01},
booktitle = {Proceedings of the 11th International Multi-conference Information Society (IS'08)},
volume = {C},
pages = {70-74},
address = {Ljubljana, Slovenia},
keywords = {database, dataset, emotion recognition, facial expression recognition, multimodal database, speech technology, spontaneous emotions},
pubstate = {published},
tppubtype = {inproceedings}
}
|