2025
|
Pernus, Martin; Fookes, Clinton; Struc, Vitomir; Dobrisek, Simon FICE: Text-conditioned fashion-image editing with guided GAN inversion Journal Article In: Pattern Recognition, vol. 158, no. 111022, pp. 1-18, 2025. @article{PR_FICE_2024,
title = {FICE: Text-conditioned fashion-image editing with guided GAN inversion},
author = {Martin Pernus and Clinton Fookes and Vitomir Struc and Simon Dobrisek},
url = {https://www.sciencedirect.com/science/article/pii/S0031320324007738
https://lmi.fe.uni-lj.si/wp-content/uploads/2024/09/FICE_main_paper.pdf
https://lmi.fe.uni-lj.si/wp-content/uploads/2024/09/FICE_supplementary.pdf},
doi = {https://doi.org/10.1016/j.patcog.2024.111022},
year = {2025},
date = {2025-02-01},
urldate = {2025-02-01},
journal = {Pattern Recognition},
volume = {158},
number = {111022},
pages = {1-18},
abstract = {Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from:
https://github.com/MartinPernus/FICE},
keywords = {computer vision for fashion, GAN inversion, generative adversarial networks, generative AI, image editing, text conditioning},
pubstate = {published},
tppubtype = {article}
}
Fashion-image editing is a challenging computer-vision task where the goal is to incorporate selected apparel into a given input image. Most existing techniques, known as Virtual Try-On methods, deal with this task by first selecting an example image of the desired apparel and then transferring the clothing onto the target person. Conversely, in this paper, we consider editing fashion images with text descriptions. Such an approach has several advantages over example-based virtual try-on techniques: (i) it does not require an image of the target fashion item, and (ii) it allows the expression of a wide variety of visual concepts through the use of natural language. Existing image-editing methods that work with language inputs are heavily constrained by their requirement for training sets with rich attribute annotations or they are only able to handle simple text descriptions. We address these constraints by proposing a novel text-conditioned editing model called FICE (Fashion Image CLIP Editing) that is capable of handling a wide variety of diverse text descriptions to guide the editing procedure. Specifically, with FICE, we extend the common GAN-inversion process by including semantic, pose-related, and image-level constraints when generating images. We leverage the capabilities of the CLIP model to enforce the text-provided semantics, due to its impressive image–text association capabilities. We furthermore propose a latent-code regularization technique that provides the means to better control the fidelity of the synthesized images. We validate the FICE through rigorous experiments on a combination of VITON images and Fashion-Gen text descriptions and in comparison with several state-of-the-art, text-conditioned, image-editing approaches. Experimental results demonstrate that the FICE generates very realistic fashion images and leads to better editing than existing, competing approaches. The source code is publicly available from:
https://github.com/MartinPernus/FICE |
2023
|
Pernuš, Martin; Štruc, Vitomir; Dobrišek, Simon MaskFaceGAN: High Resolution Face Editing With Masked GAN Latent Code Optimization Journal Article In: IEEE Transactions on Image Processing, 2023, ISSN: 1941-0042. @article{MaskFaceGAN,
title = {MaskFaceGAN: High Resolution Face Editing With Masked GAN Latent Code Optimization},
author = {Martin Pernuš and Vitomir Štruc and Simon Dobrišek},
url = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10299582
https://lmi.fe.uni-lj.si/wp-content/uploads/2023/02/MaskFaceGAN_compressed.pdf
https://arxiv.org/pdf/2103.11135.pdf},
doi = {10.1109/TIP.2023.3326675},
issn = {1941-0042},
year = {2023},
date = {2023-10-27},
urldate = {2023-01-02},
journal = {IEEE Transactions on Image Processing},
abstract = {Face editing represents a popular research topic within the computer vision and image processing communities. While significant progress has been made recently in this area, existing solutions: ( i ) are still largely focused on low-resolution images, ( ii ) often generate editing results with visual artefacts, or ( iii ) lack fine-grained control over the editing procedure and alter multiple (entangled) attributes simultaneously, when trying to generate the desired facial semantics. In this paper, we aim to address these issues through a novel editing approach, called MaskFaceGAN that focuses on local attribute editing. The proposed approach is based on an optimization procedure that directly optimizes the latent code of a pre-trained (state-of-the-art) Generative Adversarial Network (i.e., StyleGAN2) with respect to several constraints that ensure: ( i ) preservation of relevant image content, ( ii ) generation of the targeted facial attributes, and ( iii ) spatially–selective treatment of local image regions. The constraints are enforced with the help of an (differentiable) attribute classifier and face parser that provide the necessary reference information for the optimization procedure. MaskFaceGAN is evaluated in extensive experiments on the FRGC, SiblingsDB-HQf, and XM2VTS datasets and in comparison with several state-of-the-art techniques from the literature. Our experimental results show that the proposed approach is able to edit face images with respect to several local facial attributes with unprecedented image quality and at high-resolutions (1024×1024), while exhibiting considerably less problems with attribute entanglement than competing solutions. The source code is publicly available from: https://github.com/MartinPernus/MaskFaceGAN.},
keywords = {CNN, computer vision, deep learning, face editing, face image processing, GAN, GAN inversion, generative models, StyleGAN},
pubstate = {published},
tppubtype = {article}
}
Face editing represents a popular research topic within the computer vision and image processing communities. While significant progress has been made recently in this area, existing solutions: ( i ) are still largely focused on low-resolution images, ( ii ) often generate editing results with visual artefacts, or ( iii ) lack fine-grained control over the editing procedure and alter multiple (entangled) attributes simultaneously, when trying to generate the desired facial semantics. In this paper, we aim to address these issues through a novel editing approach, called MaskFaceGAN that focuses on local attribute editing. The proposed approach is based on an optimization procedure that directly optimizes the latent code of a pre-trained (state-of-the-art) Generative Adversarial Network (i.e., StyleGAN2) with respect to several constraints that ensure: ( i ) preservation of relevant image content, ( ii ) generation of the targeted facial attributes, and ( iii ) spatially–selective treatment of local image regions. The constraints are enforced with the help of an (differentiable) attribute classifier and face parser that provide the necessary reference information for the optimization procedure. MaskFaceGAN is evaluated in extensive experiments on the FRGC, SiblingsDB-HQf, and XM2VTS datasets and in comparison with several state-of-the-art techniques from the literature. Our experimental results show that the proposed approach is able to edit face images with respect to several local facial attributes with unprecedented image quality and at high-resolutions (1024×1024), while exhibiting considerably less problems with attribute entanglement than competing solutions. The source code is publicly available from: https://github.com/MartinPernus/MaskFaceGAN. |
Plesh, Richard; Peer, Peter; Štruc, Vitomir GlassesGAN: Eyewear Personalization using Synthetic Appearance Discovery and Targeted Subspace Modeling Proceedings Article In: Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) , 2023. @inproceedings{PleshCVPR2023,
title = {GlassesGAN: Eyewear Personalization using Synthetic Appearance Discovery and Targeted Subspace Modeling},
author = {Richard Plesh and Peter Peer and Vitomir Štruc},
url = {https://arxiv.org/pdf/2210.14145.pdf
https://openaccess.thecvf.com/content/CVPR2023/html/Plesh_GlassesGAN_Eyewear_Personalization_Using_Synthetic_Appearance_Discovery_and_Targeted_Subspace_CVPR_2023_paper.html},
year = {2023},
date = {2023-06-18},
urldate = {2023-06-18},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision and Pattern Recognition (CVPR) },
abstract = {We present GlassesGAN, a novel image editing framework for custom design of glasses, that sets a new standard in terms of image quality, edit realism, and continuous multi-style edit capability. To facilitate the editing process with GlassesGAN, we propose a Targeted Subspace Modelling (TSM) procedure that, based on a novel mechanism for (synthetic) appearance discovery in the latent space of a pre-trained GAN generator, constructs an eyeglasses-specific (latent) subspace that the editing framework can utilize. Additionally, we also introduce an appearance-constrained subspace initialization (SI) technique that centers the latent representation of the given input image in the well-defined part of the constructed subspace to improve the reliability of the learned edits. We test GlassesGAN on two (diverse) high-resolution datasets (CelebA-HQ and SiblingsDB-HQf) and compare it to three state-of-the-art competitors, i.e., InterfaceGAN, GANSpace, and MaskGAN. The reported results show that GlassesGAN convincingly outperforms all competing techniques, while offering additional functionality (e.g., fine-grained multi-style editing) not available with any of the competitors. The source code will be made freely available.},
keywords = {eyewear, eyewear personalization, face editing, GAN inversion, latent space editing, StyleGAN2, synthetic appearance discovery, targeted subspace modeling, virtual try-on},
pubstate = {published},
tppubtype = {inproceedings}
}
We present GlassesGAN, a novel image editing framework for custom design of glasses, that sets a new standard in terms of image quality, edit realism, and continuous multi-style edit capability. To facilitate the editing process with GlassesGAN, we propose a Targeted Subspace Modelling (TSM) procedure that, based on a novel mechanism for (synthetic) appearance discovery in the latent space of a pre-trained GAN generator, constructs an eyeglasses-specific (latent) subspace that the editing framework can utilize. Additionally, we also introduce an appearance-constrained subspace initialization (SI) technique that centers the latent representation of the given input image in the well-defined part of the constructed subspace to improve the reliability of the learned edits. We test GlassesGAN on two (diverse) high-resolution datasets (CelebA-HQ and SiblingsDB-HQf) and compare it to three state-of-the-art competitors, i.e., InterfaceGAN, GANSpace, and MaskGAN. The reported results show that GlassesGAN convincingly outperforms all competing techniques, while offering additional functionality (e.g., fine-grained multi-style editing) not available with any of the competitors. The source code will be made freely available. |
Pernuš, Martin; Bhatnagar, Mansi; Samad, Badr; Singh, Divyanshu; Peer, Peter; Štruc, Vitomir; Dobrišek, Simon ChildNet: Structural Kinship Face Synthesis Model With Appearance Control Mechanisms Journal Article In: IEEE Access, pp. 1-22, 2023, ISSN: 2169-3536. @article{AccessMartin2023,
title = {ChildNet: Structural Kinship Face Synthesis Model With Appearance Control Mechanisms},
author = {Martin Pernuš and Mansi Bhatnagar and Badr Samad and Divyanshu Singh and Peter Peer and Vitomir Štruc and Simon Dobrišek},
url = {https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10126110},
doi = {10.1109/ACCESS.2023.3276877},
issn = {2169-3536},
year = {2023},
date = {2023-05-17},
journal = {IEEE Access},
pages = {1-22},
abstract = {Kinship face synthesis is an increasingly popular topic within the computer vision community, particularly the task of predicting the child appearance using parental images. Previous work has been limited in terms of model capacity and inadequate training data, which is comprised of low-resolution and tightly cropped images, leading to lower synthesis quality. In this paper, we propose ChildNet, a method for kinship face synthesis that leverages the facial image generation capabilities of a state-of-the-art Generative Adversarial Network (GAN), and resolves the aforementioned problems. ChildNet is designed within the GAN latent space and is able to predict a child appearance that bears high resemblance to real parents’ children. To ensure fine-grained control, we propose an age and gender manipulation module that allows precise manipulation of the child synthesis result. ChildNet is capable of generating multiple child images per parent pair input, while providing a way to control the image generation variability. Additionally, we introduce a mechanism to control the dominant parent image. Finally, to facilitate the task of kinship face synthesis, we introduce a new kinship dataset, called Next of Kin. This dataset contains 3690 high-resolution face images with a diverse range of ethnicities and ages. We evaluate ChildNet in comprehensive experiments against three competing kinship face synthesis models, using two kinship datasets. The experiments demonstrate the superior performance of ChildNet in terms of identity similarity, while exhibiting high perceptual image quality. The source code for the model is publicly available at: https://github.com/MartinPernus/ChildNet.},
keywords = {artificial intelligence, CNN, deep learning, face generation, face synthesis, GAN, GAN inversion, kinship, kinship synthesis, StyleGAN2},
pubstate = {published},
tppubtype = {article}
}
Kinship face synthesis is an increasingly popular topic within the computer vision community, particularly the task of predicting the child appearance using parental images. Previous work has been limited in terms of model capacity and inadequate training data, which is comprised of low-resolution and tightly cropped images, leading to lower synthesis quality. In this paper, we propose ChildNet, a method for kinship face synthesis that leverages the facial image generation capabilities of a state-of-the-art Generative Adversarial Network (GAN), and resolves the aforementioned problems. ChildNet is designed within the GAN latent space and is able to predict a child appearance that bears high resemblance to real parents’ children. To ensure fine-grained control, we propose an age and gender manipulation module that allows precise manipulation of the child synthesis result. ChildNet is capable of generating multiple child images per parent pair input, while providing a way to control the image generation variability. Additionally, we introduce a mechanism to control the dominant parent image. Finally, to facilitate the task of kinship face synthesis, we introduce a new kinship dataset, called Next of Kin. This dataset contains 3690 high-resolution face images with a diverse range of ethnicities and ages. We evaluate ChildNet in comprehensive experiments against three competing kinship face synthesis models, using two kinship datasets. The experiments demonstrate the superior performance of ChildNet in terms of identity similarity, while exhibiting high perceptual image quality. The source code for the model is publicly available at: https://github.com/MartinPernus/ChildNet. |
Meden, Blaž; Gonzalez-Hernandez, Manfred; Peer, Peter; Štruc, Vitomir Face deidentification with controllable privacy protection Journal Article In: Image and Vision Computing, vol. 134, no. 104678, pp. 1-19, 2023. @article{MedenDeID2023,
title = {Face deidentification with controllable privacy protection},
author = {Blaž Meden and Manfred Gonzalez-Hernandez and Peter Peer and Vitomir Štruc},
url = {https://reader.elsevier.com/reader/sd/pii/S0262885623000525?token=BC1E21411C50118E666720B002A89C9EB3DB4CFEEB5EB18D7BD7B0613085030A96621C8364583BFE7BAE025BE3646096&originRegion=eu-west-1&originCreation=20230516115322},
doi = {https://doi.org/10.1016/j.imavis.2023.104678},
year = {2023},
date = {2023-04-01},
journal = {Image and Vision Computing},
volume = {134},
number = {104678},
pages = {1-19},
abstract = {Privacy protection has become a crucial concern in today’s digital age. Particularly sensitive here are facial images, which typically not only reveal a person’s identity, but also other sensitive personal information. To address this problem, various face deidentification techniques have been presented in the literature. These techniques try to remove or obscure personal information from facial images while still preserving their usefulness for further analysis. While a considerable amount of work has been proposed on face deidentification, most state-of-theart solutions still suffer from various drawbacks, and (a) deidentify only a narrow facial area, leaving potentially important contextual information unprotected, (b) modify facial images to such degrees, that image naturalness and facial diversity is suffering in the deidentify images, (c) offer no flexibility in the level of privacy protection ensured, leading to suboptimal deployment in various applications, and (d) often offer an unsatisfactory tradeoff between the ability to obscure identity information, quality and naturalness of the deidentified images, and sufficient utility preservation. In this paper, we address these shortcomings with a novel controllable face deidentification technique that balances image quality, identity protection, and data utility for further analysis. The proposed approach utilizes a powerful generative model (StyleGAN2), multiple auxiliary classification models, and carefully designed constraints to guide the deidentification process. The approach is validated across four diverse datasets (CelebA-HQ, RaFD, XM2VTS, AffectNet) and in comparison to 7 state-of-the-art competitors. The results of the experiments demonstrate that the proposed solution leads to: (a) a considerable level of identity protection, (b) valuable preservation of data utility, (c) sufficient diversity among the deidentified faces, and (d) encouraging overall performance.},
keywords = {CNN, deep learning, deidentification, face recognition, GAN, GAN inversion, privacy, privacy protection, StyleGAN2},
pubstate = {published},
tppubtype = {article}
}
Privacy protection has become a crucial concern in today’s digital age. Particularly sensitive here are facial images, which typically not only reveal a person’s identity, but also other sensitive personal information. To address this problem, various face deidentification techniques have been presented in the literature. These techniques try to remove or obscure personal information from facial images while still preserving their usefulness for further analysis. While a considerable amount of work has been proposed on face deidentification, most state-of-theart solutions still suffer from various drawbacks, and (a) deidentify only a narrow facial area, leaving potentially important contextual information unprotected, (b) modify facial images to such degrees, that image naturalness and facial diversity is suffering in the deidentify images, (c) offer no flexibility in the level of privacy protection ensured, leading to suboptimal deployment in various applications, and (d) often offer an unsatisfactory tradeoff between the ability to obscure identity information, quality and naturalness of the deidentified images, and sufficient utility preservation. In this paper, we address these shortcomings with a novel controllable face deidentification technique that balances image quality, identity protection, and data utility for further analysis. The proposed approach utilizes a powerful generative model (StyleGAN2), multiple auxiliary classification models, and carefully designed constraints to guide the deidentification process. The approach is validated across four diverse datasets (CelebA-HQ, RaFD, XM2VTS, AffectNet) and in comparison to 7 state-of-the-art competitors. The results of the experiments demonstrate that the proposed solution leads to: (a) a considerable level of identity protection, (b) valuable preservation of data utility, (c) sufficient diversity among the deidentified faces, and (d) encouraging overall performance. |