arXiv:2604.11417v2 Announce Type: replace-cross
Abstract: Co-speech gestures increase engagement and improve speech understanding. Most data-driven robot systems generate rhythmic beat-like motion, yet few integrate semantic emphasis. To address this, we propose a lightweight transformer that derives iconic gesture placement and intensity from text and emotion alone, requiring no audio input at inference time. The model outperforms GPT-4o in both semantic gesture placement classification and intensity regression on the BEAT2 dataset, while remaining computationally compact and suitable for real-time deployment on embodied agents.
Disclosure in the era of generative artificial intelligence
Generative artificial intelligence (AI) has rapidly become embedded in academic writing, assisting with tasks ranging from language editing to drafting text and producing evidence. Despite



