LongCat-Video-Avatar 1.5 was released at the end of May 2026 by the LongCat research project from Meituan, a Chinese tech company known for local services and growing AI work. Over the last year the team has released language models, multimodal tools, video generators and AI agent research.
The model is based on the earlier LongCat-Video system, a 13.6B parameter video model that supports text-to-video, image-to-video and video continuation. Avatar 1.5 is built for talking characters and virtual humans. It adds stronger audio control and better facial animation.
One major update is the move from the Wav2Vec2 audio encoder to Whisper Large. The technical report says this improves lip sync, speech timing and support for more languages and speaking styles. The team also sped up generation with an optimized process that uses about 8 diffusion steps.
Version 1.5 focuses on practical improvements instead of a completely new design. The team worked on identity consistency in longer videos, multi-person scenes, realistic body motion, singing, stylized characters and object interaction. Human testing in the report suggests results similar to several leading commercial avatar systems in selected benchmarks.
A key part of the release is that the model weights are publicly available. This lets developers and researchers run and test the system locally instead of relying only on an API. That makes LongCat-Video-Avatar 1.5 one of the more open high-end avatar video models available right now.
Main improvements in v1.5. Whisper Large audio encoder, better mouth movement accuracy, stronger time consistency, more stable long videos, support for multiple people, singing animation, improved anime and non-human characters and a faster 8-step inference process.
Output types. Audio-driven talking videos, human avatars, anime-style characters, multi-person conversations, singing performances, long-form videos and up to 720p output based on LongCat documentation.
Hardware needs. The creators have not clearly published simple consumer GPU requirements. The base model has about 13.6 billion parameters which makes it one of the larger open-source video models. Estimated VRAM needs are around 30–40 GB or more for FP16, about 18–24 GB for 8-bit versions and possibly lower for optimized community builds.
MLX ports for Apple Silicon and other optimized versions are already appearing. More consumer-friendly options may become available over time. People on X have reported that generation speed is still fairly slow.
If you'd like to access this model, you can explore the following possibilities: