Nvidia today launched Nvidia Maxine, a platform that provides developers with a suite of GPU-accelerated AI conferencing software to enhance video quality. The company describes Maxine as a “cloud-native” solution that makes it possible for service providers to bring AI effects — including gaze correction, super-resolution, noise cancellation, face relighting, and more — to end users.
Developers, software partners, and service providers can apply for early access to Maxine starting this week.
Videoconferencing has exploded during the pandemic, as it offers a way to communicate while minimizing infection risk. In late April, Zoom surpassed 300 million daily meeting participants, up from 200 million earlier in the month and 10 million in December. According to a report from App Annie, business conferencing apps topped 62 million downloads during the week of March 14-21.
Nvidia says Maxine “dramatically” reduces how much bandwidth is required for videoconferencing calls. Instead of streaming an entire screen of pixels, the platform analyzes the facial points of each person on a call and then algorithmically reanimates the face in the video on the other side. This ostensibly makes it possible to stream with far less data flowing back and forth across the internet. Nvidia claims developers using Maxine can reduce bandwidth to one-tenth the requirements of the H.264 standard.
To achieve this improved compression, Nvidia says it’s employing AI models called generative adversarial networks (GANs). GANs — two-part models consisting of a generator that creates samples and a discriminator that attempts to differentiate between these samples and real-world samples — have demonstrated impressive feats of media synthesis. Top-performing GANs can create realistic portraits of people who don’t exist, for instance, or snapshots of fictional apartment buildings.
Maxine’s other spotlight feature is face alignment, which enables faces to be automatically adjusted so participants appear to be facing each other during a call. Gaze correction helps simulate eye contact, even if the camera isn’t aligned with the user’s screen. Auto-frame allows the video feed to follow a speaker as they move away from the screen. And developers can let call participants choose their own avatars, with animations automatically driven by their voice and tone.
Maxine also leverages Nvidia’s Jarvis SDK for conversational features, including AI language models for speech recognition, language understanding, and speech generation. Developers can use them to build videoconferencing assistants that take notes and answer questions in humanlike voices. Moreover, the toolsets can power translations and transcriptions to help participants understand what’s being discussed.
Avaya is an early adopter of the Maxine platform. Through the company’s Avaya Spaces videoconferencing app, customers will benefit from background noise removal, virtual green screen backgrounds, and features enabling presenters to be overlaid on top of presentation content, as well as live transcriptions that can recognize and differentiate voices.
According to Nvidia, the AI models powering Maxine’s infrastructure, audio, and visual components were developed through hundreds of thousands of training hours on Nvida DGX systems. This robustness and Maxine’s backend, which takes advantage of microservice running in Kubernetes container clusters on GPUs, enable the platform to support up to hundreds of thousands of users even while running AI features simultaneously.