Take A Look At Sora

There’s a new model in town – and pretty soon, we’re all going to be talking about it … a lot!

OpenAI just dropped an announcement on its Sora model (named after the Japanese word for ‘sky’) which is going to immediately make all sorts of video marketing projects obsolete.

Want to generate a compelling video? You don’t need to hire dozens of people to run cameras, or stand in front of them. You don’t need to go ‘on set’ – just put some text into the model, and you’ll get amazing video that would have otherwise cost you tens of thousands of dollars to make.

It’s kind of hard to get your head around everything that Sora is going to do, but it shouldn’t take long to see the effects when OpenAI eventually releases this diffusion model.

When OpenAI’s explanation page says “Sora is capable of generating entire videos all at once, or extending generated videos to make them longer” you sort of have an idea of how powerful this model is going to be!

So how does it work?

OpenAI reveals that the diffusion model starts off with something that looks like noise, and starts removing that noise incrementally. Authors also note that this model is similar to previous ones, in that it uses small units of data to build results.

You can find this explanation on the page:

“Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance. We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.”

There’s more on these patches in a technical resource linked to the announcement:

“We take inspiration from large language models, which acquire generalist capabilities by training on Internet-scale data…The success of the LLM paradigm is enabled, in part, by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. … Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data. We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.

There’s also this bit, further clarifying:

“At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space, and subsequently decomposing the representation into spacetime patches.”

The company is also up-front about some of the limitations of the technology. When you read about the cookie-bite business, for example, you can see how certain ‘tells’ will still clue us in that an AI created a particular video.

For more, let’s go over to our own MIT Technology Review, with a piece by William Douglas Heaven last week.

Heaven goes over some of the most impressive capabilities of Sora, including handling what he calls ‘occlusion’ – in other words, the program can track objects as they disappear from or emerge into view.

At the same time, he suggests the technology is “not perfect,” and raises the possibility that OpenAI is cherry-picking its video results to make the model look more capable than it is – because Sora hasn’t yet been released, we can’t be sure.

They’re working on safety, and trying to limit inputs that would create harmful kinds of deepfakes – but if you’ve been following the AI revolution, you know this is easier said than done. Anyway, I wanted to get this out there so people are aware of what’s going on. As OpenAI writes:

“Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.”

Read the full article here