OpenAI Debuts 'Insane Quality' Video Generation Model
Today, OpenAI took the wraps off Sora, its video generation model, and astonishment has been pouring in from social media.
“This is insane quality,” posted Nate Chan. MIT podcaster Lex Friedman called it “truly remarkable” while popular YouTuber MrBeast begged OpenAI CEO Sam Altman to “plz don’t make me homeless.’
The AI race remains in high gear. OpenAI debuted Sora on the same day that Google unveiled an update to its large multimodal model, Gemini 1.5, which had jaw-dropping news itself – it boasts a context window of up to one million tokens, meaning users can input up to 700,000 words in a query, or one hour of video. Last month, Google launched Lumiere, a video generation model that was lauded for its realism.
OpenAI said in a blog post that Sora can take text or still images to generate videos up to a minute long “while maintaining visual quality and adherence to the user’s prompt.” The video can show different angles of a scene and emotions in several characters.
Crucially, Sora “understands not only what the user asked for in the prompt, but also how those things exist in the physical world,” OpenAI claims.
See a screenshot below of a video Sora created from this prompt: “A cat waking up its sleeping owner demanding breakfast. The owner tries to ignore the cat, but the cat tries new tactics and finally the owner pulls out a secret stash of treats from under the pillow to hold the cat off a little longer.”
Related:Google's Lumiere: New AI Model that Creates Realistic Videos
Credit: OpenAI
However, OpenAI said Sora has several weaknesses, including not understanding cause and effect. For example, if a person bites into a cookie, the cookie remains intact. Also, Sora can mix up right and left.
OpenAI said it tasked hacker groups to find vulnerabilities in Sora so the model will be safe for release.
The secret sauce
OpenAI said Sora is a diffusion model, which uses a technique that adds random noise to a dataset and then learning to reverse the process to build high-quality data samples. It also uses transformer architecture.
The startup said Sora can create an entire video all the once or lengthen generated videos. The model sees many frames at one time so the subject of the video stays the same even if it is out of view temporarily.
Sora was trained on ‘patches,’ which are smaller collections of videos and images. Each patch is similar to a token in OpenAI’s GPT language models. “By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios,” according to OpenAI.
Related:Text-to-Video Generative AI Models: The Definitive List
Sora borrows from OpenAI’s text-to-image model, DALL-E 3, and GPT language models. From DALL-E 3, Sora borrowed the recaptioning technique to generate “highly descriptive captions” for visual training data, OpenAI said. That means Sora can more faithfully follow user prompts.
Sora not only takes text and image inputs, but it also can accept an existing video to extend it or fill in missing frames, the startup said.
OpenAI did not say whether Sora would be incorporated into ChatGPT, just like it did with DALL-E 3, to make the chatbot truly multimodal. In contrast, Google's Gemini language model is multimodal from the ground up.
However, “Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI,” according to OpenAI.
A technical paper with more details on Sora is coming soon.