Wan 2.2-S2V-14B turns an image and audio clip into a cinematic video. You give it a voice recording (or singing), a still picture, and maybe some text for extra detail. It spits out a lip synced video that looks like a short film.
Built by WAN AI and Tongyi Lab. It’s part of their Wan 2.2 update which uses a Mixture-of-Experts setup for better video without slowing things down.
The model makes 480p and 720p videos at 24fps. Works fine on high-end consumer GPUs like RTX 4090.
You can label stuff like lighting and contrast to tweak how it looks.
The model's on Hugging Face under Wan-AI/Wan2.2-S2V-14B and runs under Apache 2.0 license.
If you'd like to access this model, you can explore the following possibilities: