Meta is using AI to create videos from just a few words

CNN business

Artificial intelligence is getting better at creating an image in response to a few words, with publicly available AI image generators such as DALL-E 2 and Stable Diffusion. Now, Meta researchers are taking AI a step further: using it to create videos from a text message.

Meta CEO Mark Zuckerberg posted on Facebook Thursday about the study, called Make-A-Video, which compiled several text alerts used by Meta researchers and the resulting (very short) videos into a 20-second clip. Suggestions include: “A bear painting a self-portrait,” “A spaceship landing on Mars,” “A lazy kid in a knit hat trying to invent a laptop,” and “A robot surfing a wave in the ocean.”

The videos for each invitation are a few seconds long, and generally show what the invitation suggests (except for the baby sloth, which doesn’t look much like the real thing), in a rather low resolution and a little grainy. the style Still, it shows the new direction AI research is taking as systems get better at creating images from words. If the technology eventually becomes widespread, however, it will raise the same concerns raised by text-image systems, such as the possibility that video can be used to spread disinformation.

A Make-A-Video web page includes these short clips and others, some of which look quite realistic, such as a video created in response to the alert “Clownfish swimming through a coral reef” or a video showing “A Young Man”. the couple walks in the pouring rain”.

Zuckerberg noted in his Facebook post how difficult it is to create a moving image from just a few words.

“Video is much more difficult to create than photos because, beyond creating each pixel correctly, the system also has to predict how it will change over time,” he wrote.

A research paper describing the work explains that the project uses a text-image AI model to learn how words match images, and an AI technique called unsupervised learning, in which algorithms analyze unlabeled data. models inside — to watch videos and determine what realistic movement looks like.

As with popular AI systems that generate text-to-images, the researchers noted that the text-image AI model was trained on Internet data, meaning it “likely learned exaggerated, including harmful, social biases,” the researchers noted. he wrote They noted that they filter the data for “NSFW content and toxic words,” but since datasets can contain millions of images and lots of text, it may not be possible to remove all of that content.

Zuckerberg wrote that Meta plans to share the Make-A-Video project as a demo in the future.