Gan, Chenquan; Yang, Yucheng; Zhub, Qingyi; Jain, Deepak Kumar; Struc, Vitomir
In: Expert Systems with Applications, vol. 210, 2022.
To balance the trade-off between contextual information and fine-grained information in identifying specific emotions during a dialogue and combine the interaction of hierarchical feature related information, this paper proposes a hierarchical feature interactive fusion network (named DHF-Net), which not only can retain the integrity of the context sequence information but also can extract more fine-grained information. To obtain a deep semantic information, DHF-Net processes the task of recognizing dialogue emotion and dialogue act/intent separately, and then learns the cross-impact of two tasks through collaborative attention. Also, a bidirectional gate recurrent unit (Bi-GRU) connected hybrid convolutional neural network (CNN) group method is designed, by which the sequence information is smoothly sent to the multi-level local information layers for feature exaction. Experimental results show that, on two open session datasets, the performance of DHF-Net is improved by 1.8% and 1.2%, respectively.
Šircelj, Jaka; Peer, Peter; Solina, Franc; Štruc, Vitomir
In: Proceedings of ERK 2022, pp. 1-4, 2022.
We introduce a new method to reconstruct 3D objects using a set of volumetric primitives, i.e., superquadrics. The method hierarchically decomposes a target 3D object into pairs of superquadrics recovering finer and finer details. While such hierarchical methods have been studied before, we introduce a new way of splitting the object space using only properties of the predicted superquadrics. The method is trained and evaluated on the ShapeNet dataset. The results of our experiments suggest that reasonable reconstructions can be obtained with the proposed approach for a diverse set of objects with complex geometry.
Babnik, Žiga; Štruc, Vitomir
In: Proceedings of ERK 2022, pp. 1-4, 2022.
While recent face recognition (FR) systems achieve excellent results in many deployment scenarios, their performance in challenging real-world settings is still under question. For this reason, face image quality assessment (FIQA) techniques aim to support FR systems, by providing them with sample quality information that can be used to reject poor quality data unsuitable for recognition purposes. Several groups of FIQA methods relying on different concepts have been proposed in the literature, all of which can be used for generating quality scores of facial images that can serve as pseudo ground-truth (quality) labels and be exploited for training (regression-based) quality estimation models. Several FIQA approaches show that a significant amount of sample-quality information can be extracted from mated similarity-score distributions generated with some face matcher. Based on this insight, we propose in this paper a quality label optimization approach, which incorporates sample-quality information from mated-pair similarities into quality predictions of existing off-the-shelf FIQA techniques. We evaluate the proposed approach using three state-of-the-art FIQA methods over three diverse datasets. The results of our experiments show that the proposed optimization procedure heavily depends on the number of executed optimization iterations. At ten iterations, the approach seems to perform the best, consistently outperforming the base quality scores of the three FIQA methods, chosen for the experiments.
Tomašecić, Darian; Peer, Peter; Solina, Franc; Jaklič, Aleš; Štruc, Vitomir
Reconstructing Superquadrics from Intensity and Color Images Journal Article
In: Sensors, vol. 22, iss. 4, no. 5332, 2022.
The task of reconstructing 3D scenes based on visual data represents a longstanding problem in computer vision. Common reconstruction approaches rely on the use of multiple volumetric primitives to describe complex objects. Superquadrics (a class of volumetric primitives) have shown great promise due to their ability to describe various shapes with only a few parameters. Recent research has shown that deep learning methods can be used to accurately reconstruct random superquadrics from both 3D point cloud data and simple depth images. In this paper, we extended these reconstruction methods to intensity and color images. Specifically, we used a dedicated convolutional neural network (CNN) model to reconstruct a single superquadric from the given input image. We analyzed the results in a qualitative and quantitative manner, by visualizing reconstructed superquadrics as well as observing error and accuracy distributions of predictions. We showed that a CNN model designed around a simple ResNet backbone can be used to accurately reconstruct superquadrics from images containing one object, but only if one of the spatial parameters is fixed or if it can be determined from other image characteristics, e.g., shadows. Furthermore, we experimented with images of increasing complexity, for example, by adding textures, and observed that the results degraded only slightly. In addition, we show that our model outperforms the current state-of-the-art method on the studied task. Our final result is a highly accurate superquadric reconstruction model, which can also reconstruct superquadrics from real images of simple objects, without additional training.
Dvoršak, Grega; Dwivedi, Ankita; Štruc, Vitomir; Peer, Peter; Emeršič, Žiga
In: International Workshop on Biometrics and Forensics (IWBF), pp. 1–6, 2022.
The analysis of kin relations from visual data represents a challenging research problem with important real-world applications. However, research in this area has mostly been limited to the analysis of facial images, despite the potential of other physical (human) characteristics for this task. In this paper, we therefore study the problem of kinship verification from ear images and investigate whether salient appearance characteristics, useful for this task, can be extracted from ear data. To facilitate the study, we introduce a novel dataset, called KinEar, that contains data from 19 families with each family member having from 15 to 31 ear images. Using the KinEar data, we conduct experiments using a Siamese training setup and 5 recent deep learning backbones. The results of our experiments suggests that ear images represent a viable alternative to other modalities for kinship verification, as 4 out of 5 considered models reach a performance of over 60% in terms of the Area Under the Receiver Operating Characteristics (ROC-AUC).
Jug, Julijan; Lampe, Ajda; Štruc, Vitomir; Peer, Peter
Body Segmentation Using Multi-task Learning Inproceedings
In: International Conference on Artificial Intelligence in Information and Communication (ICAIIC), IEEE, 2022, ISBN: 978-1-6654-5818-4.
Body segmentation is an important step in many computer vision problems involving human images and one of the key components that affects the performance of all downstream tasks. Several prior works have approached this problem using a multi-task model that exploits correlations between different tasks to improve segmentation performance. Based on the success of such solutions, we present in this paper a novel multi-task model for human segmentation/parsing that involves three tasks, i.e., (i) keypoint-based skeleton estimation, (ii) dense pose prediction, and (iii) human-body segmentation. The main idea behind the proposed Segmentation--Pose--DensePose model (or SPD for short) is to learn a better segmentation model by sharing knowledge across different, yet related tasks. SPD is based on a shared deep neural network backbone that branches off into three task-specific model heads and is learned using a multi-task optimization objective. The performance of the model is analysed through rigorous experiments on the LIP and ATR datasets and in comparison to a recent (state-of-the-art) multi-task body-segmentation model. Comprehensive ablation studies are also presented. Our experimental results show that the proposed multi-task (segmentation) model is highly competitive and that the introduction of additional tasks contributes towards a higher overall segmentation performance.
Grm, Klemen; Vitomir, Štruc
Frequency Band Encoding for Face Super-Resolution Inproceedings
In: Proceedings of ERK 2021, pp. 1-4, 2021.
In this paper, we present a novel method for face super-resolution based on an encoder-decoder architecture. Unlike previous approaches, which focused primarily on directly reconstructing the high-resolution face appearance from low-resolution images, our method relies on a multi-stage approach where we learn a face representation in different frequency bands, followed by decoding the representation into a high-resolution image. Using quantitative experiments, we are able to demonstrate that this approach results in better face image reconstruction, as well as aiding in downstream semantic tasks such as face recognition and face verification.
Bortolato, Blaž; Ivanovska, Marija; Rot, Peter; Križaj, Janez; Terhorst, Philipp; Damer, Naser; Peer, Peter; Štruc, Vitomir
In: Proceedings of FG 2020, IEEE, 2020.
Convolutional Neural Networks (CNNs) are today the de-facto standard for extracting compact and discriminative face representations (templates) from images in automatic face recognition systems. Due to the characteristics of CNN models, the generated representations typically encode a multitude of information ranging from identity to soft-biometric attributes, such as age, gender or ethnicity. However, since these representations were computed for the purpose of identity recognition only, the soft-biometric information contained in the templates represents a serious privacy risk. To mitigate this problem, we present in this paper a privacy-enhancing approach capable of suppressing potentially sensitive soft-biometric information in face representations without significantly compromising identity information. Specifically, we introduce a Privacy-Enhancing Face-Representation learning Network (PFRNet) that disentangles identity from attribute information in face representations and consequently allows to efficiently suppress soft-biometrics in face templates. We demonstrate the feasibility of PFRNet on the problem of gender suppression and show through rigorous experiments on the CelebA, Labeled Faces in the Wild (LFW) and Adience datasets that the proposed disentanglement-based approach is highly effective and improves significantly on the existing state-of-the-art.
Šircelj, Jaka; Oblak, Tim; Grm, Klemen; Petković, Uroš; Jaklič, Aleš; Peer, Peter; Štruc, Vitomir; Solina, Franc
In: 25th Computer Vision Winter Workshop (CVWW 2020), 2020.
In this paper we address the problem of representing 3D visual data with parameterized volumetric shape primitives. Specifically, we present a (two-stage) approach built around convolutional neural networks (CNNs) capable of segmenting complex depth scenes into the simpler geometric structures that can be represented with superquadric models. In the first stage, our approach uses a Mask RCNN model to identify superquadric-like structures in depth scenes and then fits superquadric models to the segmented structures using a specially designed CNN regressor. Using our approach we are able to describe complex structures with a small number of interpretable parameters. We evaluated the proposed approach on synthetic as well as real-world depth data and show that our solution does not only result in competitive performance in comparison to the state-of-the-art, but is able to decompose scenes into a number of superquadric models at a fraction of the time required by competing approaches. We make all data and models used in the paper available from https://lmi.fe.uni-lj.si/en/research/resources/sq-seg.
Stepec, Dejan; Emersic, Ziga; Peer, Peter; Struc, Vitomir
Constellation-Based Deep Ear Recognition Incollection
In: Jiang, R.; Li, CT.; Crookes, D.; Meng, W.; Rosenberger, C. (Ed.): Deep Biometrics: Unsupervised and Semi-Supervised Learning, Springer, 2020, ISBN: 978-3-030-32582-4.
This chapter introduces COM-Ear, a deep constellation model for ear recognition. Different from competing solutions, COM-Ear encodes global as well as local characteristics of ear images and generates descriptive ear representations that ensure competitive recognition performance. The model is designed as dual-path convolutional neural network (CNN), where one path processes the input in a holistic manner, and the second captures local images characteristics from image patches sampled from the input image. A novel pooling operation, called patch-relevant-information pooling, is also proposed and integrated into the COM-Ear model. The pooling operation helps to select features from the input patches that are locally important and to focus the attention of the network to image regions that are descriptive and important for representation purposes. The model is trained in an end-to-end manner using a combined cross-entropy and center loss. Extensive experiments on the recently introduced Extended Annotated Web Ears (AWEx).
Grm, Klemen; Scheirer, Walter J.; Štruc, Vitomir
In: IEEE Transactions on Image Processing, 2020.
In this paper we address the problem of hallucinating high-resolution facial images from low-resolution inputs at high magnification factors. We approach this task with convolutional neural networks (CNNs) and propose a novel (deep) face hallucination model that incorporates identity priors into the learning procedure. The model consists of two main parts: i) a cascaded super-resolution network that upscales the lowresolution facial images, and ii) an ensemble of face recognition models that act as identity priors for the super-resolution network during training. Different from most competing super-resolution techniques that rely on a single model for upscaling (even with large magnification factors), our network uses a cascade of multiple SR models that progressively upscale the low-resolution images using steps of 2×. This characteristic allows us to apply supervision signals (target appearances) at different resolutions and incorporate identity constraints at multiple-scales. The proposed C-SRIP model (Cascaded Super Resolution with Identity Priors) is able to upscale (tiny) low-resolution images captured in unconstrained conditions and produce visually convincing results for diverse low-resolution inputs. We rigorously evaluate the proposed model on the Labeled Faces in the Wild (LFW), Helen and CelebA datasets and report superior performance compared to the existing state-of-the-art.
Vitek, Matej; Rot, Peter; Struc, Vitomir; Peer, Peter
In: Neural Computing and Applications, pp. 1-15, 2020.
The area of ocular biometrics is among the most popular branches of biometric recognition technology. This area has long been dominated by iris recognition research, while other ocular modalities such as the periocular region or the vasculature of the sclera have received significantly less attention in the literature. Consequently, ocular modalities beyond the iris are not well studied and their characteristics are today still not as well understood. While recent needs for more secure authentication schemes have considerably increased the interest in competing ocular modalities, progress in these areas is still held back by the lack of publicly available datasets that would allow for more targeted research into specific ocular characteristics next to the iris. In this paper, we aim to bridge this gap for the case of sclera biometrics and introduce a novel dataset designed for research into ocular biometrics and most importantly for research into the vasculature of the sclera. Our dataset, called Sclera Blood Vessels, Periocular and Iris (SBVPI), is, to the best of our knowledge, the first publicly available dataset designed specifically with research in sclera biometrics in mind. The dataset contains high-quality RGB ocular images, captured in the visible spectrum, belonging to 55 subjects. Unlike competing datasets, it comes with manual markups of various eye regions, such as the iris, pupil, canthus or eyelashes and a detailed pixel-wise annotation of the complete sclera vasculature for a subset of the images. Additionally, the datasets ship with gender and age labels. The unique characteristics of the dataset allow us to study aspects of sclera biometrics technology that have not been studied before in the literature (e.g. vasculature segmentation techniques) as well as issues that are of key importance for practical recognition systems. Thus, next to the SBVPI dataset we also present in this paper a comprehensive investigation into sclera biometrics and the main covariates that affect the performance of sclera segmentation and recognition techniques, such as gender, age, gaze direction or image resolution. Our experiments not only demonstrate the usefulness of the newly introduced dataset, but also contribute to a better understanding of sclera biometrics in general.
Rot, Peter; Vitek, Matej; Grm, Klemen; Emeršič, Žiga; Peer, Peter; Štruc, Vitomir
Deep Sclera Segmentation and Recognition Incollection
In: Uhl, Andreas; Busch, Christoph; Marcel, Sebastien; Veldhuis, Rainer (Ed.): Handbook of Vascular Biometrics, pp. 395-432, Springer, 2019, ISBN: 978-3-030-27731-4.
In this chapter, we address the problem of biometric identity recognition from the vasculature of the human sclera. Specifically, we focus on the challenging task of multi-view sclera recognition, where the visible part of the sclera vasculature changes from image to image due to varying gaze (or view) directions. We propose a complete solution for this task built around Convolutional Neural Networks (CNNs) and make several contributions that result in state-of-the-art recognition performance, i.e.: (i) we develop a cascaded CNN assembly that is able to robustly segment the sclera vasculature from the input images regardless of gaze direction, and (ii) we present ScleraNET, a CNN model trained in a multi-task manner (combining losses pertaining to identity and view-direction recognition) that allows for the extraction of discriminative vasculature descriptors that can be used for identity inference. To evaluate the proposed contributions, we also introduce a new dataset of ocular images, called the Sclera Blood Vessels, Periocular and Iris (SBVPI) dataset, which represents one of the few publicly available datasets suitable for research in multi-view sclera segmentation and recognition. The datasets come with a rich set of annotations, such as a per-pixel markup of various eye parts (including the sclera vasculature), identity, gaze-direction and gender labels. We conduct rigorous experiments on SBVPI with competing techniques from the literature and show that the combination of the proposed segmentation and descriptor-computation models results in highly competitive recognition performance.
Oblak, Tim; Grm, Klemen; Jaklič, Aleš; Peer, Peter; Štruc, Vitomir; Solina, Franc
In: 2019 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), pp. 45-52, IEEE, 2019.
It has been a longstanding goal in computer vision to describe the 3D physical space in terms of parameterized volumetric models that would allow autonomous machines to understand and interact with their surroundings. Such models are typically motivated by human visual perception and aim to represents all elements of the physical word ranging from individual objects to complex scenes using a small set of parameters. One of the de facto standards to approach this problem are superquadrics - volumetric models that define various 3D shape primitives and can be fitted to actual 3D data (either in the form of point clouds or range images). However, existing solutions to superquadric recovery involve costly iterative fitting procedures, which limit the applicability of such techniques in practice. To alleviate this problem, we explore in this paper the possibility to recover superquadrics from range images without time consuming iterative parameter estimation techniques by using contemporary deep-learning models, more specifically, convolutional neural networks (CNNs). We pose the superquadric recovery problem as a regression task and develop a CNN regressor that is able to estimate the parameters of a superquadric model from a given range image. We train the regressor on a large set of synthetic range images, each containing a single (unrotated) superquadric shape and evaluate the learned model in comparative experiments with the current state-of-the-art. Additionally, we also present a qualitative analysis involving a dataset of real-world objects. The results of our experiments show that the proposed regressor not only outperforms the existing state-of-the-art, but also ensures a 270x faster execution time.
Lozej, Juš; Meden, Blaž; Struc, Vitomir; Peer, Peter
End-to-end iris segmentation using U-Net Inproceedings
In: 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), pp. 1–6, IEEE 2018.
Iris segmentation is an important research topic that received significant attention from the research community over the years. Traditional iris segmentation techniques have typically been focused on hand-crafted procedures that, nonetheless, achieved remarkable segmentation performance even with images captured in difficult settings. With the success of deep-learning models, researchers are increasingly looking towards convolutional neural networks (CNNs) to further improve on the accuracy of existing iris segmentation techniques and several CNN-based techniques have already been presented recently in the literature. In this paper we also consider deep-learning models for iris segmentation and present an iris segmentation approach based on the popular U-Net architecture. Our model is trainable end-to-end and, hence, avoids the need for hand designing the segmentation procedure. We evaluate the model on the CASIA dataset and report encouraging results in comparison to existing techniques used in this area.
Emeršič, Žiga; Štepec, Dejan; Štruc, Vitomir; Peer, Peter
In: IEEE International Conference on Automatic Face and Gesture Recognition, Workshop on Biometrics in the Wild 2017, 2017.
Identity recognition from ear images is an active field of research within the biometric community. The ability to capture ear images from a distance and in a covert manner makes ear recognition technology an appealing choice for surveillance and security applications as well as related application domains. In contrast to other biometric modalities, where large datasets captured in uncontrolled settings are readily available, datasets of ear images are still limited in size and mostly of laboratory-like quality. As a consequence, ear recognition technology has not benefited yet from advances in deep learning and convolutional neural networks (CNNs) and is still lacking behind other modalities that experienced significant performance gains owing to deep recognition technology. In this paper we address this problem and aim at building a CNNbased ear recognition model. We explore different strategies towards model training with limited amounts of training data and show that by selecting an appropriate model architecture, using aggressive data augmentation and selective learning on existing (pre-trained) models, we are able to learn an effective CNN-based model using a little more than 1300 training images. The result of our work is the first CNN-based approach to ear recognition that is also made publicly available to the research community. With our model we are able to improve on the rank one recognition rate of the previous state-of-the-art by more than 25% on a challenging dataset of ear images captured from the web (a.k.a. in the wild).
Grm, Klemen; Štruc, Vitomir; Artiges, Anais; Caron, Matthieu; Ekenel, Hazim K.
In: IET Biometrics, vol. 7, no. 1, pp. 81–89, 2017.
Convolutional neural network (CNN) based approaches are the state of the art in various computer vision tasks including face recognition. Considerable research effort is currently being directed toward further improving CNNs by focusing on model architectures and training techniques. However, studies systematically exploring the strengths and weaknesses of existing deep models for face recognition are still relatively scarce. In this paper, we try to fill this gap and study the effects of different covariates on the verification performance of four recent CNN models using the Labelled Faces in the Wild dataset. Specifically, we investigate the influence of covariates related to image quality and model characteristics, and analyse their impact on the face verification performance of different deep CNN models. Based on comprehensive and rigorous experimentation, we identify the strengths and weaknesses of the deep learning models, and present key areas for potential future research. Our results indicate that high levels of noise, blur, missing pixels, and brightness have a detrimental effect on the verification performance of all models, whereas the impact of contrast changes and compression artefacts is limited. We find that the descriptor-computation strategy and colour information does not have a significant influence on performance.
Stržinar, Žiga; Grm, Klemen; Štruc, Vitomir
In: Proceedings of the Electrotechnical and Computer Science Conference (ERK), Portorož, Slovenia, 2016.
Učenje podobnosti med pari vhodnih slik predstavlja enega najpopularnejših pristopov k razpoznavanju na področju globokega učenja. Pri tem pristopu globoko nevronsko omrežje na vhodu sprejme par slik (obrazov) in na izhodu vrne mero podobnosti med vhodnima slikama, ki jo je moč uporabiti za razpoznavanje. Izračun podobnosti je pri tem lahko v celoti udejanjen z globokim omrežjem, lahko pa se omrežje uporabi zgolj za izračun predstavitve vhodnega para slik, preslikava iz izračunane predstavitve v mero podobnosti pa se izvede z drugim, potencialno primernejšim modelom. V tem prispevku preizkusimo 5 različnih modelov za izvedbo preslikave med izračunano predstavitvijo in mero podobnosti, pri čemer za poizkuse uporabimo lastno nevronsko omrežje. Rezultati naših eksperimentov na problemu razpoznavanja obrazov kažejo na pomembnost izbire primernega modela, saj so razlike med uspešnostjo razpoznavanje od modela do modela precejšnje.
Grm, Klemen; Dobrišek, Simon; Štruc, Vitomir
Deep pair-wise similarity learning for face recognition Inproceedings
In: 4th International Workshop on Biometrics and Forensics (IWBF), pp. 1–6, IEEE 2016.
Recent advances in deep learning made it possible to build deep hierarchical models capable of delivering state-of-the-art performance in various vision tasks, such as object recognition, detection or tracking. For recognition tasks the most common approach when using deep models is to learn object representations (or features) directly from raw image-input and then feed the learned features to a suitable classifier. Deep models used in this pipeline are typically heavily parameterized and require enormous amounts of training data to deliver competitive recognition performance. Despite the use of data augmentation techniques, many application domains, predefined experimental protocols or specifics of the recognition problem limit the amount of available training data and make training an effective deep hierarchical model a difficult task. In this paper, we present a novel, deep pair-wise similarity learning (DPSL) strategy for deep models, developed specifically to overcome the problem of insufficient training data, and demonstrate its usage on the task of face recognition. Unlike existing (deep) learning strategies, DPSL operates on image-pairs and tries to learn pair-wise image similarities that can be used for recognition purposes directly instead of feature representations that need to be fed to appropriate classification techniques, as with traditional deep learning pipelines. Since our DPSL strategy assumes an image pair as the input to the learning procedure, the amount of training data available to train deep models is quadratic in the number of available training images, which is of paramount importance for models with a large number of parameters. We demonstrate the efficacy of the proposed learning strategy by developing a deep model for pose-invariant face recognition, called Pose-Invariant Similarity Index (PISI), and presenting comparative experimental results on the FERET an IJB-A datasets.
Grm, Klemen; Dobrišek, Simon; Štruc, Vitomir
The pose-invariant similarity index for face recognition Inproceedings
In: Proceedings of the Electrotechnical and Computer Science Conference (ERK), Portorož, Slovenia, 2015.