Other languages

Generating Audio-Driven Videos of Naturally Controlled Facial Expressions

Authors: Riko Tsushima, Shota Harada, Marc A. Kastner, Ikuhisa Mitsugami

Abstract:

Quantifying the impression of customer service behavior is not easy because the correspondence between facial expressions and impressions is not clear. To obtain this correspondence, multiple store clerks must perform customer service actions with different facial expressions as specified, and they need to repeat this behavior consistently for multiple evaluators. This is difficult for humans to do. This problem can be solved by generating arbitrary facial expression videos for impression evaluations. We can use these videos in a VR environment to evaluate impressions. To solve this problem, previous research proposed a method for generating diverse facial expression videos using image generative AI. This method is based on the First Order Motion Model (FOMM), a method for animating a single still image based on a reference video. In this method, facial expressions are arbitrarily controlled by incorporating mechanisms into FOMM that allows for changing smiles and gaze. For example, to control the degree of a smile, we prepare two images of a person's face. One is neutral and the other is a smiling face. By extracting feature vectors from these two images and adding them to the feature vectors of a reference video, we can generate a desired smile level video from an arbitrary still image while transferring facial movements. However, this method does not take speech information into account during generation. So, the mouth in the video is sometimes unnatural when speaking with a high smile level. For instance, there are artifacts such as the mouth is not fully closed when speaking. This study proposes a facial expression video generation that considers speech sound for generating more natural videos. We incorporated IP_LAP, which can generate audio-driven talking head, into previous method. We then generate feature vectors that take speech information into account. IP_LAP takes audio and a reference image (or video) as input and can generate a video where the mouth movements are driven by the audio. IP_LAP consists of a two-stage framework, a process of generating landmarks from audio (Transformer-based) and a process of rendering video from the landmarks. We use the landmarks generated in the first stage. From two reference videos (smiling and neutral), we obtain two audio-driven landmarks and then calculate their difference. The obtained difference is converted into a FOMM feature vector, generating a facial expression video that is audio-driven and with an arbitrarily controlled smile level. The results confirmed that our method generates more natural speech with fully closed mouths, even in frames where previous methods don’t close the mouth completely.

Type: 17th Asia-Pacific Workshop on Mixed and Augmented Reality (APMAR 2025), Pitch Your Work Presentation Track

Publication date: To be published in Sep 2025

If you have questions or ideas about this research, feel free to leave a comment below or send me an email. I will reply quickly.