EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence

arXiv:2604.23325v1 Announce Type: cross Abstract: Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

arXiv:2604.23466v1 Announce Type: cross Abstract: NVIDIA’s CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction for GPU kernel development that aims to simplify programming while retaining

Mixture of Heterogeneous Grouped Experts for Language Modeling

arXiv:2604.23108v1 Announce Type: cross Abstract: Large Language Models (LLMs) based on Mixture-of-Experts (MoE) are pivotal in industrial applications for their ability to scale performance efficiently.

Training Machine Learning Models on Encrypted Data: A Privacy-Preserving Framework using Homomorphic Encryption

arXiv:2604.23245v1 Announce Type: cross Abstract: The use of Machine Learning (ML) for data-driven decision-making often relies on access to sensitive datasets, which introduces privacy challenges.

Institutions for the Post-Scarcity of Judgment

arXiv:2604.22966v1 Announce Type: cross Abstract: Each major technological revolution inverts a particular scarcity and rebuilds institutions around the shift. The near-consensus diagnosis of the AI

EAD-Net: Emotion-Aware Talking Head Generation with Spatial Refinement and Temporal Coherence

April 28, 2026

arXiv:2604.23325v1 Announce Type: cross
Abstract: Emotionally talking head video generation aims to generate expressive portrait videos with accurate lip synchronization and emotional facial expressions. Current methods rely on simple emotional labels, leading to insufficient semantic information. While introducing high-level semantics enhances expressiveness, it easily causes lip-sync degradation. Furthermore, mainstream generation methods struggle to balance computational efficiency and global motion awareness in long videos and suffer from poor temporal coherence. Therefore, we propose an textbfEmotion-textbfAware textbfDiffusion model-based textbfNetwork, called textbfEAD-Net. We introduce SyncNet supervision and Temporal Representation Alignment (TREPA) to mitigate lip-sync degradation caused by multi-modal fusion. To model complex spatio-temporal dependencies in long video sequences, we propose a Spatio-Temporal Directional Attention (STDA) mechanism that captures global motion patterns through strip attention. Additionally, we design a Temporal Frame graph Reasoning Module (TFRM) to explicitly model temporal coherence between video frames through graph structure learning. To enhance emotional semantic control, a large language model is employed to extract textual descriptions from real videos, serving as high-level semantic guidance. Experiments on the HDTF and MEAD datasets demonstrate that our method outperforms existing methods in terms of lip-sync accuracy, temporal consistency, and emotional accuracy.

Subscribe for Updates

Copyright 2025 dijee Intelligence Ltd. dijee Intelligence Ltd. is a private limited company registered in England and Wales at Media House, Sopers Road, Cuffley, Hertfordshire, EN6 4RY, UK registration number 16808844