LTX-2.3 is a multimodal AI model that makes video with synced audio released in the beginning of March 2026. It works from text prompts, images, or audio input. The system is a newer version of the earlier LTX-2 engine. This update brings sharper visuals, steadier motion, better prompt reading, and native vertical video.
The model aims at production-style video workflows. People can run it through an API or download the model weights and run it locally on their own hardware.
LTX-2.3 belongs to the LTX family of generative video models from Lightricks. The company builds creative apps like Facetune and Videoleap. With the LTX project the team is trying to give creators and studios tools that make video with AI, but in a way that fits real editing pipelines.
The system builds on the core design from LTX-2. That design is a multimodal audiovisual generator. Instead of making only pictures, the model creates video and sound together. Motion, speech, and background audio are generated at the same time so they stay lined up.
Version 2.3 improves several pieces of that pipeline. One big change is an upgraded variational autoencoder trained on higher quality footage. That upgrade helps the model produce clearer textures, cleaner edges, and more readable text inside scenes.
Prompt understanding also got stronger. The model now uses a larger text connector that reads longer and more complex instructions. That means prompts with many objects, camera motion, or style hints tend to work better.
Another addition is native portrait video generation. The model was trained directly on vertical footage instead of cropping landscape clips. So it can create vertical video up to 1080×1920, which fits social platforms like TikTok, Instagram Reels, and YouTube Shorts.
People can use the model in a few ways. Developers can call it through an API to generate clips inside apps or workflows. Creators can run it through the LTX platform or the LTX Desktop app. The company also released open checkpoints on Hugging Face so researchers or developers can run the model locally.
The output focus is video with synced sound. It can generate landscape or vertical clips and HD resolution video. Portrait clips can reach 1080×1920. Earlier LTX systems support clips around twenty seconds, and this version continues that long-clip direction.
Inputs are multimodal. The model can start from text prompts, images, or audio. Those signals guide how the scene looks, how the camera moves, and what sound plays in the clip.
Key features people can use. Video generation. The system creates video clips with synced audio. Text-to-video. A written prompt describes the scene and motion. Image-to-video. A starting image becomes a moving scene. Audio-to-video. Audio input can guide visuals or timing. Sound effects generation. Background noise or environment sounds appear in the clip.
Control and scene tools. Camera controls. Users describe camera moves or framing. Scene creation. Prompts define environments and objects. Script generation. Prompts can guide dialogue or narrative flow. Storyboards. Multi-step prompts shape a sequence of shots.
Editing and enhancement tools. Video editor. Clips can be adjusted or refined. Video outpainting. The model expands scenes past the original frame. Start-end frames. Users define beginning and ending keyframes. Speed adjustment. Motion timing can change during editing.
Image and visual tools. Image-to-image. One image becomes a new version with changes. Style transfer. The model shifts visual style but keeps the scene. Style presets. Preset looks help shape the final output.
In simple terms LTX-2.3 aims to be a video creation engine with sound built in. The system reads prompts, plans a scene, then generates visuals and audio together so the clip feels more coherent and production-ready.
If you'd like to access this model, you can explore the following possibilities: