Diffusion Model-Based Generative Pipeline for Children Song Video

Published in IEEE International Conference on Consumer Electronics (ICCE), 2025

Children songs have been essential in early childhood education, supporting cognitive development, language acquisition, and emotional expression. With the rise of digital media, the traditional children songs have evolved into multimedia experiences, including music videos. However, the creation of these videos is a resource-intensive process that requires a blend of artistic and technical expertise.

Diffusion Model Pipeline Figure 1: Diffusion model-based generative pipeline for children song video generation

Meanwhile, recent advancements in generative models, especially diffusion models, have shown impressive text-to-image capabilities, though they still face limitations in generating temporally coherent video content. This paper explores an innovative approach to generating music videos for children songs, that convert children song lyrics into visually appealing and contextually relevant video content. Our approach integrates natural language processing to interpret lyrics and computer vision techniques to generate corresponding animations and visuals.

Our experiments on 20 prompts demonstrate that the Cascade SD model outperforms the other four models across three evaluation measures. The qualitative analysis on 10 prompts demonstrates the superiority of the Cascade SD model and highlights the effectiveness of negative prompting and secondary prompting techniques.

Download paper here

Recommended citation: Lee, Sanghyuck, Timur Khairulov, and Jaesung Lee. "Diffusion Model-Based Generative Pipeline for Children Song Video." 2025 IEEE International Conference on Consumer Electronics (ICCE). IEEE, 2025.
Download Paper