Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation

Zikai Huang1, Yihan Zhou1, Xuemiao Xu1,3,4,*, Cheng Xu2*, Xiaofen Xing1, Jing Qin2, Shengfeng He5
1South China University of Technology, 2The Hong Kong Polytechnic University, 3Guangdong Engineering Center for Large Model and GenAI Technology, 4Guangdong Provincial Key Lab of Computational Intelligence and Cyberspace Information, 5Singapore Management University
Best viewed with audio 🎧

Abstract

Singing-driven 3D head animation is a compelling yet underexplored task with broad applications in virtual avatars, entertainment, and education. Compared to speech, singing conveys richer emotional nuance, dynamic prosody, and lyric-conditioned semantics, necessitating the synthesis of fine-grained and temporally coherent facial motion. Existing speech-driven approaches, which typically map audio directly to motion through implicit phoneme-to-viseme correspondences, often yield over-smoothed, emotionally flat, and semantically inconsistent results. These limitations render them inadequate for the unique demands of singing-driven animation.

To address this challenge, we propose Think2Sing, a unified diffusion-based framework that integrates pretrained large language models to generate semantically consistent and temporally coherent 3D head animations conditioned on both lyrics and acoustics. Central to our approach is the introduction of an auxiliary semantic representation called motion subtitles, derived via a novel Singing Chain-of-Thought reasoning process augmented with acoustic-guided retrieval. These subtitles contain precise timestamps and region-specific motion descriptions, serving as interpretable and expressive motion priors that guide the animation process. Rather than learning a direct audio-to-motion mapping, we reformulate the task as a motion intensity prediction problem, which quantifies the dynamic behavior of key facial regions. This reformulation decomposes the complex mapping into tractable subtasks, facilitates region-wise control, and improves the modeling of subtle and expressive motion patterns. To support this task, we construct the first multimodal singing dataset comprising synchronized video clips, acoustic descriptors, and structured motion subtitles. This dataset enables expressive and diverse motion learning under rich acoustic and semantic conditioning. Extensive experiments demonstrate that Think2Sing significantly outperforms state-of-the-art methods in realism, expressiveness, and emotional fidelity. Furthermore, our framework supports flexible subtitle-conditioned editing, enabling precise and user-controllable animation synthesis.

Method

SingMoSub Dataset

We present samples from the proposed dataset, where acoustic descriptions are displayed at the top and motion subtitles are provided at the bottom for eyebrows, eyes, mouth and neck pose respectively.

Comparison

Visualizations with Motion Subtitles

Application

Generation for Long Sequence

Rigging with GaussianAvatars

We present rigging results using GaussianAvatars, demonstrating the practical applicability of our approach in generating expressive, temporally coherent 3D head motions, enhancing virtual avatar interactions in gaming, entertainment, and virtual reality applications.

BibTeX

@article{huang2025think2sing,
      title   = {Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation},
      author  = {Zikai Huang and Yihan Zhou and Xuemiao Xu and Cheng Xu and Xiaofen Xing and Jing Qin and Shengfeng He},
      year    = {2025},
      journal = {arXiv preprint arXiv: 2509.02278}
    }