Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation

Zikai Huang1, Yihan Zhou1, Xuemiao Xu1,3,4,5*, Cheng Xu2*, Xiaofen Xing1, Jing Qin2, Shengfeng He6
1South China University of Technology, 2The Hong Kong Polytechnic University, 3Guangdong Engineering Center for Large Model and GenAI Technology, 4State Key Laboratory of Subtropical Building and Urban Science, 5Ministry of Education Key Laboratory of Big Data and Intelligent Robot, 6Singapore Management University
Best viewed with audio 🎧

Abstract

Singing-driven 3D head animation is a compelling yet underexplored task with broad applications in virtual avatars, entertainment, and education. Existing speech-driven approaches, which typically map audio directly to motion through implicit phoneme-to-viseme correspondences, often yield over-smoothed, emotionally flat, and semantically inconsistent results. These limitations render them inadequate for the unique demands of singing-driven animation. To address this challenge, we propose Think2Sing, a unified diffusion-based framework that integrates pretrained large language models to generate semantically consistent and temporally coherent 3D head animations conditioned on both lyrics and acoustics.

Central to our framework is the introduction of motion subtitles, a structured, time-aligned representation generated via a Singing Chain-of-Thought process with acoustic-guided retrieval. These subtitles provide region-specific expressive cues that serve as interpretable priors for animation synthesis. We further formulate head animation as motion intensity prediction over key facial regions, enabling fine-grained control and more faithful expressive modeling. To support this paradigm, we construct the first multimodal singing dataset with synchronized 3D motion, acoustic features, and aligned motion subtitles, enabling semantically grounded and expressive motion learning. Extensive experiments demonstrate that Think2Sing significantly outperforms state-of-the-art methods in realism, expressiveness, and emotional fidelity. Furthermore, our framework supports flexible subtitle-conditioned editing, enabling precise and user-controllable animation synthesis.

Method

SingMoSub Dataset

We present samples from the proposed dataset, where acoustic descriptions are displayed at the top and motion subtitles are provided at the bottom for eyebrows, eyes, mouth and neck pose respectively.

Comparison

Visualizations with Motion Subtitles

Application

Generation for Long Sequence

Different Languages

Different Music Genres

Rigging with GaussianAvatars

We present rigging results using GaussianAvatars, demonstrating the practical applicability of our approach in generating expressive, temporally coherent 3D head motions, enhancing virtual avatar interactions in gaming, entertainment, and virtual reality applications.

BibTeX

@article{huang2025think2sing,
      title   = {Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation},
      author  = {Zikai Huang and Yihan Zhou and Xuemiao Xu and Cheng Xu and Xiaofen Xing and Jing Qin and Shengfeng He},
      year    = {2025},
      journal = {arXiv preprint arXiv: 2509.02278}
    }