LongCat-Video is a base model for making videos. It handles text-to-video, image-to-video, and video continuation using one setup. It’s made by Meituan’s LongCat team and packs 13.6 billion parameters. The Meituan LongCat Team also put out the LongCat‑Flash‑Chat language model under the same LongCat name on Hugging Face.
You can make 720p clips at 30fps in minutes. It does this using a coarse-to-fine method and block-sparse attention to keep things fast and smooth. It’s trained to keep videos stable over longer runs, so stuff like color shifts or visual breakdowns don’t show up much. The model’s under MIT license and the code and weights are up for grabs, so you can use it for both work and testing.
The setup works for three modes. From text to video, from a single image to video, and adding more frames to an already existing clip. They also added something called timeline prompting. This lets you map out a second-by-second plan like: 2 sec put on headphones, 4 sec close laptop, 6 sec stand up. The model follows that timing while keeping things looking natural.
You can grab it off Hugging Face and run it on your own gear. It works on single or multiple GPUs. They don’t give a clear VRAM limit, but you’ll want a modern GPU. On an H100 (80 GB), the model uses about 42 GB while running. For full quality, you’ll likely need 80 GB or more. With 48 GB, it can probably run fine. With something like a 24 GB RTX 4090, you’ll need tricks like offloading or splitting the model up.
There’s also a distilled version for quicker runs and a refined version for 720p at 30fps with better lighting and smoother motion. The basic output on lower VRAM setups might hit 480p at 15fps, more for rough motion tests.
LongCat-Video is part of a wave of open-source tools pushing toward longer, more stable video generation. It's not perfect, but it gives folks a way to go past short clips. And since it’s open and free, more people can jump in and use it.
If you'd like to access this model, you can explore the following possibilities: