Meituan's video generation model is here! It is an open-source SOTA right out of the gate

Meituan has launched an open-source video generation model called LongCat-Video, with 13.6 billion parameters, supporting text-to-video and image-to-video generation, with durations of up to several minutes. This model shows significant improvements in the realism of generated videos and physical understanding, outperforming other open-source models and competing with Google's closed-source model Veo3. LongCat-Video is licensed under the MIT License and has the capability to generate videos at 720p and 30fps, emphasizing understanding of the real world

Meituan, you are addicted to cross-border ventures, aren't you? (doge)

That's right, the latest open-source SOTA video model comes from this "delivery" company.

The model is named LongCat-Video, with 13.6B parameters, supporting text-to-video and image-to-video generation, with video lengths of up to several minutes.

From the demo released by the official source, the videos generated by the model are not only more realistic and natural but also have enhanced physical understanding capabilities.

Whether it's hoverboarding in the air:

Or transforming with special effects in one second:

Or a bike riding video from a first-person perspective that requires maintaining consistent visuals throughout the entire duration _ (lasting over 4 minutes) _:

Upon closer inspection, the AI flavor of the video has indeed decreased significantly.

Moreover, based on the evaluation results, its performance is quite impressive—its text-to-video capabilities rank at the top among open-source models, with overall quality surpassing PixVerse-V5 and Wan2.2-T2V-A14B, and some core dimensions even rivaling Google's latest and most powerful closed-source model Veo3.

Additionally, since it is licensed under the commercially usable MIT License, even senior executives at Hugging Face expressed their astonishment with a series of three questions.

A Chinese team actually released a foundational video model under the MIT License???

Its long video generation capability _ (stably outputting 5 minutes) _ is also seen as "we are one step closer to the ultimate form of video AI."

So, how does a video model produced by a takeaway company actually work? Let's take a look at more cases.

Text-to-Video and Image-to-Video Open Source SOTA, Can Generate Long Videos Like Making a Series!

Overall, Meituan's recently released and open-sourced LongCat-Video has the following features:

Text-to-Video: Can generate 720p, 30fps high-definition videos, with semantic understanding and visual presentation capabilities reaching open-source SOTA level;
Image-to-Video: Able to retain the main attributes of the reference image, background relationships, and overall style;
Video Extension: Core differentiated capability that can continue video content based on multi-frame conditional frames.

In terms of Text-to-Video, from the cases provided by the official source, this model particularly emphasizes the ability to understand the real world.

At a glance, the homepage features a series of videos on soccer, gymnastics, dancing, etc.:

Taking "Water Ballet" as an example, the challenges faced by the model are quite significant—it needs to have a high level of detail capture ability while also being able to handle complex lighting effects, environmental simulation, and dynamic scenes.

LongCat-Video has considered almost all of these aspects, achieving a high level of completion:

In terms of Image-to-Video, with Double Eleven approaching, major merchants can also use it to create more practical promotional videos:

Of course, since it provides the original reference image, we usually pay more attention to whether the image-to-video can maintain consistency.

When given an image of a robot working, LongCat-Video immediately generated a daily vlog of the robot "working from home."

At one moment it picks up a teddy bear from the table, at another it grabs a water cup, and even shuts down the computer after work... Under different actions, the desktop and surrounding environment remain unchanged, successfully overcoming the consistency challenge.

Once the "difficult problem" of consistency is solved, the possibilities for LongCat-Video expand even further.

During the day it can serve as a mural, and at night it can come out to play games _ (who says it’s not a true wall-breaking experience?)._

It can even produce animated feature films:

Moreover, the core capability of LongCat-Video lies in video extension, allowing it to generate minute-long videos like making a series.

Once a video is completed, you just need to continue writing prompts, and ultimately a complete storyline or segment can be generated.

For example, the following nearly half-minute video was achieved step by step through the following prompts _ (Chinese translation version)_:

The kitchen is bright and well-ventilated, with white cabinets and wooden countertops complementing each other. A freshly baked loaf of bread is placed on the cutting board, next to a glass and a box of milk. A lady in a floral apron stands by the wooden countertop, skillfully cutting a golden loaf of bread with a sharp knife The bread is placed on the cutting board, and as she cuts it, crumbs fly everywhere.
The camera pulls back, the woman puts down the knife in her hand, reaches for the milk carton, and then pours it into the glass on the table.
The woman puts down the milk carton.
The woman picks up the milk glass and takes a sip.

How is it? Does it feel like filming a movie or TV drama?

Knock on the blackboard, since LongCat-Video itself has undergone pre-training for continuous video tasks, it can produce videos lasting several minutes without color drift or quality degradation _ (generally stable output of 5-minute level long videos, with no quality loss) _.

Meituan stated that the reason for launching LongCat-Video is primarily aimed at the cutting-edge field of world models:

As an intelligent system capable of modeling physical laws, spatiotemporal evolution, and scene logic, world models empower AI with the ability to "see" the essence of how the world operates. Video generation models are expected to become a key pathway for constructing world models—by compressing various forms of knowledge such as geometry, semantics, and physics through video generation tasks, AI can simulate, deduce, and even rehearse the operation of the real world in digital space.

In order to construct the video model LongCat-Video, Meituan has also made a series of innovations and breakthroughs in technology.

Underlying Technical Principles

LongCat-Video has only 13.6B, but integrates three major tasks: text-to-video, image-to-video, and video continuation.

Specifically, the entire model is designed based on the Diffusion Transformer _ (DiT) _, where each Transformer block consists of 3D self-attention layers, cross-attention layers, and feedforward networks using the SwiGLU activation function.

It employs the AdaLN-Zero modulation mechanism, integrating each Transformer block into a dedicated modulation multilayer perceptron, and uses RMSNorm normalization in the self-attention and cross-attention modules to enhance training stability. Additionally, 3D RoPE is used for the positional encoding of visual tokens.

All tasks are defined as video continuation tasks, distinguished by conditional frame counts:

Text to video: 0 frame condition.
Image to video: 1 frame condition.
Video continuation: multiple frame conditions.

After unifying mixed inputs, the noise-free conditional frames and the noise frames to be denoised are concatenated along the time axis, combined with temporal step configurations, to achieve native multi-task support with a single model To adapt to this type of input, the research team designed a block attention mechanism with a key-value cache (KVCache) in the architecture, which ensures that conditional tokens are not affected by noise tokens, and subsequently allows for the caching and reuse of the KV features of conditional tokens, enhancing the efficiency of long video generation.

The most notable long video generation capability is primarily achieved through two core features: native pre-training design and interactive generation support.

First, LongCat-Video abandons the traditional training path of "first training basic video generation capabilities, then fine-tuning for long video tasks," and instead pre-trains directly on the video continuation task.

This approach directly addresses the cumulative error problem in long video generation, avoiding color drift and quality degradation while generating minute-long videos.

Additionally, LongCat-Video supports interactive long video generation, allowing users to set independent instructions for different segments, further expanding the flexibility of long video creation.

To improve the inference efficiency of video generation, the team proposed a coarse-to-fine generation paradigm, first generating low-resolution, low-frame-rate videos at 480p and 15fps, and then upgrading the resolution to 720p and 30fps through trilinear interpolation, while optimizing details with a refined expert model trained with LoRA.

By introducing block sparse attention, the attention computation is reduced to less than 10% of the original, further optimizing high-resolution generation efficiency with context-parallel circular block sparse attention.

Combining CFG distillation and consistency model (CM) distillation, the number of sampling steps is reduced from 50 to 16, achieving the generation of a single 720p, 30fps video on a single H800 GPU within minutes, with an efficiency improvement of over 10 times.

Additionally, for video generation scenarios, the Group Relative Policy Optimization (GRPO) algorithm is used to enhance the convergence speed and generation quality of GRPO in video generation tasks.

During the training process, three types of dedicated reward models are used:

Visual Quality (VQ): evaluated using HPSv3-general and HPSv3-percentile
Motion Quality (MQ): Fine-tuned based on the VideoAlign model and trained using grayscale videos to avoid color bias.
Text-Video Alignment (TA): Also fine-tuned based on the VideoAlign model, but retains the original color input.

Then, multi-reward weighted fusion training is conducted to avoid overfitting and reward deception issues from a single reward, achieving a balanced improvement in visual, motion, and alignment capabilities.

After completing data construction and model training, the research team first conducted internal benchmarking, mainly assessing the performance of text-to-video and image-to-video.

For text-to-video, it includes four dimensions: text alignment, visual quality, motion quality, and overall quality.

Experimental results show that LongCat-Video surpasses PixVerse-V5 and Wan2.2-T2V-A14B in overall quality scores, with visual quality close to Wan2.2-T2V-A14B, only slightly inferior to the closed-source model Veo3.

Image-to-video adds an image alignment dimension for evaluation, with LongCat-Video achieving the highest visual quality score _ (3.27) _, indicating competitive overall quality, but there is still room for improvement in image alignment and motion quality.

Additionally, the research team conducted an open benchmark test for VBench 2.0, with LongCat-Video ranking third _ (62.11%) _, only behind Veo3 _ (66.72%) _ and Vidu Q1 _ (62.7%) _.

It is worth noting that LongCat-Video has a leading advantage in common-sense dimensions _ (motion rationality, adherence to physical laws) _, highlighting the model's excellent physical world modeling capabilities.

One More Thing

This is not the first time this delivery company has been "distracted"...

Since the end of August, Meituan's LongCat large model has been continuously releasing, first launching the classic open-source foundational model LongCat-Flash-Chat.

With a total of 560B parameters, it can achieve performance comparable to mainstream models on the market while activating only a small number of parameters, especially excelling in complex agent tasks Moreover, it has now launched on the API platform for use～

In less than a month, the new LongCat-Flash-Thinking has been released, achieving SOTA levels in logic, mathematics, coding, and multi-tasking for Agents. It is the first LLM in China to possess both "deep thinking + tool invocation" and "informal + formal" reasoning capabilities, enabling lower costs and better performance.

Subsequently, a specialized voice LLM called LongCat-Audio-Codec was launched, which can simultaneously extract semantic and acoustic tokens at a low frame rate (16.7Hz/60ms), achieving efficient discretization while maintaining high clarity at extremely low bit rates.

Additionally, an Agent evaluation benchmark called VitaBench was created specifically for complex real-life scenarios (food delivery, restaurant ordering, travel) to systematically measure an Agent's capabilities in reasoning, tool usage, and adaptive interaction. (Tears, finally returning to my old profession.jpg)

……

Finally, today's video generation model undoubtedly shows that "cross-border" AI is becoming the new norm for this food delivery company.

Author of this article: Yishui Luyu, Source: Quantum Bits, Original Title: "Meituan's Video Generation Model is Here! A Move That Is Open Source SOTA"

Risk Warning and Disclaimer

The market has risks, and investment requires caution. This article does not constitute personal investment advice and does not take into account the specific investment goals, financial situation, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article align with their specific circumstances. Investing based on this is at your own risk