Fatemeh Nazarieh


Postgraduate Research Student

About

My research project

Publications

Fatemeh Nazarieh, Josef Kittler, Muhammad Awais Rana, Diptesh Kanojia, Zhenhua Feng (2024)StableTalk: Advancing Audio-to-Talking Face Generation with Stable Diffusion and Vision Transformer, In: Apostolos Antonacopoulos, Subhasis Chaudhuri, Rama Chellappa, Cheng-Lin Liu, Saumik Bhattacharya, Umapada Pal (eds.), Pattern Recognitionpp. 271-286 Springer Nature Switzerland

Audio-to-talking face generation stands at the forefront of advancements in generative AI. It bridges the gap between audio and visual representations by generating synchronized and realistic talking faces. Despite recent progress, the lack of realism in animated faces, asynchronous audio-lip movements, and computational burden remain key barriers to practical applications. To address these challenges, we introduce a novel approach, StableTalk, leveraging the emerging capabilities of Stable diffusion models and vision Transformers for Talking face generation. We also integrate the Re-attention mechanism and adversarial loss to improve the consistency of facial animations and synchronization with a given audio input. More importantly, the computational efficiency of our method has been notably enhanced by optimizing operations within the latent space and dynamically adjusting the focus on different parts of the visual content based on the provided conditions. Our experimental results demonstrate the superiority of StableTalk over the existing approaches in image quality, audio-lip synchronization, and computational efficiency.

Fatemeh Nazarieh, Zhenhua Feng, Muhammad Awais, Wenwu Wang, Josef Vaclav Kittler (2024)A Survey of Cross-Modal Visual Content Generation, In: IEEE Transactions on Circuits and Systems for Video Technology Institute of Electrical and Electronics Engineers (IEEE)

Cross-modal content generation has become very popular in recent years. To generate high-quality and realistic content, a variety of methods have been proposed. Among these approaches, visual content generation has attracted significant attention from academia and industry due to its vast potential in various applications. This survey provides an overview of recent advances in visual content generation conditioned on other modalities, such as text, audio, speech, and music, with a focus on their key contributions to the community. In addition, we summarize the existing publicly available datasets that can be used for training and benchmarking cross-modal visual content generation models. We provide an in-depth exploration of the datasets used for audio-to-visual content generation, filling a gap in the existing literature. Various evaluation metrics are also introduced along with the datasets. Furthermore, we discuss the challenges and limitations encountered in the area, such as modality alignment and semantic coherence. Last, we outline possible future directions for synthesizing visual content from other modalities including the exploration of new modalities, and the development of multi-task multi-modal networks. This survey serves as a resource for researchers interested in quickly gaining insights into this burgeoning field.