Make a Video - text/image/video -> video

Today, we announced Make-A-Video, a new AI system that lets you turn text/image/video prompts (==conditioning) into brief, high-quality video clips. Make-A-Video builds on recent progress in generative technology research. The system learns what the world looks like from paired text-image data (LAION) and how the world moves from video footage with no associated text.

Generative AI research is pushing creative expression forward by giving people tools to quickly and easily create new content. With a few words or lines of text, Make-A-Video can bring imagination to life and create one-of-a-kind videos full of vivid colors, characters, and landscapes. The system can also create videos from images or take existing videos and create new ones that are similar.

First, decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model, and two super resolution models that can enable various applications besides text-to-video. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

Learn more and see that samples here.