Microsoft develops VASA-1 that creates realistic talking faces from a photo and audio in real time

22-04- 2024 02:55 PM

PETALING JAYA: Microsoft researchers have developed a system called VASA-1 that can generate lifelike and expressive talking faces for virtual characters in real-time, driven solely by audio input.

The technology allows for the creation of believable digital avatars with natural facial expressions, head movements and emotional nuances, according to Microsoft Research Asia’s blog post.

The VASA (Visual Affective Skills Avatar) framework uses a single static image and a speech audio clip to produce highly realistic video output of a virtual character’s face speaking the provided audio.

Not only can it accurately synchronise lip movements with the audio but also capture a wide range of facial subtleties and natural head motions that contribute to an authentic and lifelike appearance.

At the core of VASA-1 are models that generate facial dynamics, head movements and expressions within a specialised face latent space developed using real video data.

This enables the system to control and edit various attributes of the generated avatar independently, such as appearance, 3D head pose and facial movements.

VASA-1 also offers controllability over the generation process, allowing users to adjust factors like eye gaze direction, head distance and emotion offsets as optional input conditions.

The system exhibits strong generalisation capabilities, able to handle photo and audio inputs well outside its training distribution, such as artistic photos, singing audio and non-English speech.

From an efficiency standpoint, VASA-1 is capable of generating 512x512 resolution video frames at up to 45 frames per second in offline mode and an impressive 40 frame-per-second in online streaming mode with just 170ms of latency, running on a single high-end GPU.

The researchers believe it brings positive potential into various domains, such as enhancing educational accessibility, providing companionship or therapeutic support and aiding those with communication challenges.

While Microsoft acknowledges the potential for misuse, such as impersonating real individuals, the company is opposed to creating misleading or harmful content and is interested in applying the technology to advance forgery detection efforts.

The tech company also has no plans to release an online demo, API or product utilising VASA-1 until it can ensure the responsible use of the technology and adherence to proper regulations.