Google video-audio V2A AI technology lets you add audio to any clip

The craziest AI development I’ve seen all year is Microsoft’s VASA-1 technology. The company has developed AI models that can transform a single image of a person with an audio file into an animated video of that person speaking. The demos were stunning, even though VASA-1 is not available as a commercial product. This may never be the case, as people can easily abuse this type of AI tool.

VASA-1 was presented in mid-April. Now, almost two months later, Google Deepmind has unveiled similar AI technology. It does not have a commercial name, with Google describing it as a video-to-audio (V2A) technology. This also means that this is not a commercial AI product that you can try for yourself.

V2A lets you generate audio from a single text prompt to match a silent video clip. Google’s demos are stunning.

The video-audio tool “makes synchronized audiovisual generation possible,” as Google explains in a blog. Google has offered many examples to demonstrate V2A technology. Some of them are included below, along with the prompts Google uses to generate audio for the videos.

Audio prompt: cinematic, thriller, horror film, music, tension, atmosphere, step on concrete

“V2A combines video pixels with natural language text prompts to generate rich soundscapes for on-screen action,” Google explains, emphasizing that V2A can be combined with Veo. This is the video generation model that Google unveiled at I/O 2024. Veo is a direct competitor to OpenAI’s Sora and other similar products.

Google says V2A technology can deliver “a dramatic score, realistic sound effects, or dialogue that matches the characters and tone of a video.” This technology can be used to create soundtracks, and Google has a very interesting potential use: video to audio could add sound to silent films, which would be amazing.

Audio Prompt: A drummer on stage at a concert surrounded by flashing lights and an enthusiastic crowd

However, voice generation is not perfect, as Google explains later in the blog. Although V2A doesn’t require you to manually align audio and video, there are limitations, particularly when it comes to speech:

We’re also improving lip sync for videos involving speech. V2A attempts to generate speech from input transcriptions and synchronize it with the characters’ lip movements. But the coupled video generation model cannot be conditioned on transcriptions. This creates a lag, often resulting in strange lip sync, as the video model does not generate mouth movements to match the transcription.

Audio Prompt: Music, Transcription: “this turkey looks amazing, I’m so hungry”

Google also says it is seeking feedback from the creative community on video-to-audio technology to ensure V2A will have a positive impact. To prevent abuse, Google is adding its SynthID toolkit to V2A search to watermark AI-generated content.

It’s unclear when V2A will be available to the public, with Google saying the new technology will undergo rigorous testing. To see what is possible with V2A at the current stage of development, you will find more demo clips at this link.

Technology. Entertainment. Science. Your inbox.