Google's new AI can generate audio for your silent videos

Sound is an essential element in making a good video. That’s why, despite the realistic results from tools like Google’s Veo, OpenAI’s Sora, and Runway’s Gen-3 Alpha, videos often look lifeless. Google’s latest AI model Deepmind hopes to fill this void by generating synchronized soundtracks for your video. It’s pretty wild.

Google’s V2A (video to audio) technology combines video pixels with optional text prompts to create audio closely aligned with the visuals. It can generate music, sound effects, and even dialogue that matches the on-screen action.

Under the hood, V2A uses a streaming-based approach for realistic audio generation. The system encodes video input into a compressed representation, then iteratively refines the audio from random noise, guided by visuals and optional text prompts. The generated audio is then decoded into a waveform and combined with the video.

Google Deepmind’s V2A system. Taking video pixels and audio prompts to generate an audio waveform in sync with the underlying video

To improve audio quality and enable more specific sound generation, DeepMind trained the model on additional data such as AI-generated sound annotations and dialogue transcriptions. This allows V2A to associate audio events with various visual scenes while responding to provided annotations or transcriptions.

However, V2A is not without limitations. Audio quality depends on input video quality, with artifacts or distortions causing noticeable drops. Lip sync for voice videos also needs improvement, as the coupled video generation model may not match mouth movements to the transcription.

Additionally, you should know that there are other tools in the generative AI space that solve this problem. Earlier this year. Pika Labs released a similar feature called Sound Effects. And Eleven Labs’ recently launched sound effects generator.

According to Google, what sets its V2A apart is its ability to understand raw video pixels. It also eliminates the tedious process of manually aligning generated sounds with visuals. Its integration with video generation models like Veo creates a cohesive audiovisual experience, making it ideal for entertainment and virtual reality applications.

Google is being very careful with the release of video AI tools. For now, much to the dismay of AI content creators, there are no immediate plans for public release. Instead, the company is focused on overcoming limitations and ensuring a positive impact on the creative community. As with their other models, the release of their V2A model will include the SynthID watermark to protect against misuse.

Chris McKay is the founder and editor-in-chief of Maginative. His thought leadership in AI mastery and strategic AI adoption has been recognized by leading academic institutions, media and global brands.