AI creators tools

Vidi2 video model

Name: Vidi
Version: 2
Creator: ByteDance

Vidi2 is a large AI model built to deal with video. It can handle clips with or without audio or text and figure out what happens across time and screen space. You give it a prompt like 'the gorilla which is driving with two men' and it finds the right part of the video and shows where this scene is happening. That means it can be used for things like searching video archives, auto-editing, and answering questions about videos.

ByteDance made Vidi2 and shared the work in a paper end of November 2025. It's a newer version of their earlier model, Vidi, with better tools. It can show both when and where something happens in a video. You can ask it stuff like “When does the dog show up?” and it'll give you timestamps and boxes around the dog. It also answers text questions about video content.

To test it, the team built a new benchmark called VUE-STG. It has videos from 10 seconds up to 30 minutes with hand-marked time ranges and screen spots. They also improved an older benchmark (VUE-TR-V2) to sound more like what regular people would search for.

In tests, they say Vidi 2 beats other models like Gemini 3 Pro and GPT-5 on video search and tracking. On video QA, it does about as well as other big open-source models.

ByteDance says this tool isn’t just for research. It’s meant to help creators slice and adjust long videos into short clips, like for TikTok. That puts it in the group of tools aimed at making video editing easier and quicker.

Vidi 2 isn’t a video maker. It doesn’t create clips from scratch. It’s more of an assistant that looks through long videos and finds useful parts. It focuses on understanding and working with what’s already in the video.

So what can it actually do? It can:

  • Search videos. You give it a question and it finds where the action happens in the video.
  • Help edit. It can cut long videos into smaller pieces, reframe shots, or switch views.
  • Answer questions. You can ask what’s in a video and get clear answers tied to screen time and spot.

That could make it easier to work with loads of video footage and help folks who make content for social media, school, or entertainment.

The model is out on GitHub under a Creative Commons license (CC BY-NC 4.0). That means you can use it for non-commercial stuff, like research, education or personal projects. You can copy it, change it, and share it if you give credit and don’t make money from it.

But you can't use it in paid apps, ads, or anything that makes money. You also can't block others from using it the same way. And just because the code is free doesn’t mean the videos or data you use with it are. So be careful with what you upload.

No sample outputs available for this model yet.

No tools currently list this model.

Other Models by ByteDance