Real-Time AI Video Generation with StreamDiT: 16 FPS and 512p Resolution Unleashed

A new artificial intelligence system called StreamDiT is capable of generating real-time video based on textual descriptions, paving the way for innovative possibilities in gaming and interactive media.

Developed by researchers from Meta and the University of California, Berkeley, StreamDiT produces video at a speed of 16 frames per second using a single high-performance graphics processor. The model, which consists of 4 billion parameters, outputs video in 512p resolution. Unlike earlier systems that generated entire video clips before playback, StreamDiT creates video streams frame by frame, in real time.

The team has showcased various applications. StreamDiT can instantly generate one-minute video clips, respond to interactive requests, and even edit existing videos live. In one demonstration, a pig in the video transformed into a cat while the background remained unchanged.

The system is built on a specialized architecture designed for increased speed. StreamDiT employs a sliding buffer that processes several frames simultaneously, working on the next frame while rendering the previous one. New frames are initially noisy but gradually clear up until they are ready for display. According to the research, the system takes approximately half a second to produce two frames, resulting in eight finished images post-processing.

The training process was designed to enhance versatility. Instead of focusing on a single video generation method, the model was trained using multiple approaches on a dataset of 3,000 high-quality videos, along with a larger collection of 2.6 million videos. This training was conducted on 128 Nvidia H100 graphics processors. Researchers found that the best results were achieved by combining clips of 1 to 16 frames in length.

To ensure real-time performance, the team implemented an acceleration technique that reduces the number of computational steps from 128 to 8, with minimal impact on image quality. The architecture is also optimized for efficiency: instead of every pixel interacting with all others, information exchange occurs only between local areas.

In direct comparisons, StreamDiT outperformed existing methods like ReuseDiffuse and FIFO diffusion, particularly for videos featuring significant movement. While other models produced static scenes, StreamDiT was able to generate more dynamic and natural movements.

Human evaluators assessed the system’s performance in terms of motion smoothness, animation completeness, frame consistency, and overall quality. In all categories, StreamDiT excelled in tests of eight-second videos at 512p resolution.

The team also conducted an experiment with a much larger model containing 30 billion parameters, which delivered even higher quality video but was not fast enough for real-time use. The findings indicate that this approach can be scaled for larger systems.

However, there are still limitations, including StreamDiT’s restricted ability to «remember» earlier video fragments and occasional noticeable transitions between segments. Researchers are actively working to address these issues.

Other companies are also exploring real-time video generation capabilities using AI. For instance, Odyssey recently demonstrated an autoregressive model that adapts video frame by frame in response to user actions, making the interactive experience more accessible.

Delegate some of your routine tasks with BotHub! Access to the service does not require a VPN, and you can use a Russian credit card. Follow this link to get 100,000 free tokens for your initial tasks and start working with neural networks right away!

*Meta and its products (Instagram, Facebook) are banned in the Russian Federation.

Translation source of the news here.