MAGI-1 from Sand AI builds videos one chunk at a time using your text and images. It's open-source and real-time which means full control and easy access for creators.
MAGI-1 is a tool that takes your text and images and builds them into videos that look and feel clean. It was made by Sand AI and they kept it open-source so anyone can mess with the code or plug it into other stuff. You can use it for content videos stories even live-stream effects.
But MAGI-1 is currently too impractical to run on most consumer hardware with its 24 billion parameters, so to get 640GB+ VRAM you’ll need 8x4090 cards or 4xH100s just to get going. Rent a cloud setup instead? Get ready to fork out around $14 an hour for 8xH100s.
So here's how MAGI works. Instead of trying to build a whole video in one go MAGI-1 puts it together one second at a time using 24 frames per chunk. That way every part fits with the next and it doesn’t feel jumpy or off. This makes it good for things that need consistency like telling stories or making animated explainers.
You can throw in text+ image and MAGI-1 will figure out the video part from there. Want to keep it going past one second? It’ll keep generating and blend it right into the last part.
MAGI-1 doesn’t try to be flashy it just works in a way that gives you control. You can adjust pacing transitions even how long each part lasts. It uses a thing called block-causal attention under the hood which just means it doesn’t forget what it did earlier. I certainly enjoy being able to freely choose length from 1 to 10 seconds. You can test a prompt for just 2 or 3 seconds to see if AI has the good understanding of it, then extend only if it does.
And it runs fast. Like real-time fast. That makes it work for livestreams and interactive setups.
Magi-Human - model built on top of Magi 1, turns one photo into a lifelike video using AI. It adds voice, facial moves, and hand gestures so your photo comes alive. Just upload a photo and script to see it work.
Up to 30 Seconds. It makes short talking clips, up to 30 seconds, good for stories, ads, or characters talking.
Use Your Voice. You can upload your own voice or train one that matches your script and vibe.
Full-Body Moves. Your photo can do more than talk, it can move, change poses, and switch scenes with smooth flow. The avatar looks like it knows what it's doing.
Educators and TrainersCreative ProfessionalsContent CreatorsMedia and Film MakersMarketing and Branding SpecialistsDevelopers and Tech CreatorsNonprofit and Advocacy CreatorsSmall Business OwnersEntertainment and Performance ArtistsProfessional Content Creators
People are holding out hope for smaller cheaper MAGI versions. Quantized models could cut down VRAM needs a lot. Time will tell if that happens fast enough though.
Even if you cough up all that hardware MAGI still isn’t perfect. Picture quality isn’t blowing everyone’s socks off. Some users are scratching their heads wondering why such heavy hardware is even needed.
Some are side-eyeing the MAGI team for "forgetting" to benchmark against strong players like Kling 2 and Veo 2.
Arc left, camera orbits around A woman in a vibrant swimsuit floats mid-air, completely still, her joyful expression serene despite the chaos unfolding around her. One hand holds a colorful cocktail, the other clutches a straw hat, while her hair drifts outward in perfect suspension. Surrounding her, a cataclysmic wave crashes — but time is slowed to a crawl.
Droplets, debris, and chunks of rock hang mid-air, frozen or drifting in hyper-slow motion.
The camera orbits smoothly around her in bullet-time, capturing every suspended detail from dynamic angles.
A flock of seagulls is suspended around her, wings spread mid-flap. Lightning forks across the dark sky.
The scene balances lighthearted wonder with awe and frozen devastation — a surreal, sculpted moment carved out of time.
A vane influencer is recording a video of herself cheerfully speaking from inside of the open mouth of a shark. Shark keeps its mough open and only blinks a couple of times.
excited granny rides a chariot pulled by the cats. high motion, speed, motion blur, dust rising from the road. Her eyes are closed, then she opens them wide. another elderly cat chariot rider competing with her, seen blurred out in the back to the side of her, a policeman comically running around waving at them to stop seen briefly as the chariots race forward
Bullet-time effect: A woman in a tunic lounges effortlessly, her body ((frozen mid-air)) as large crashing waves slowly rise behind her, threatening the shoreline. She holds a vibrant green cocktail spilling upwards in one hand while a lit cigarette casually dangles from her fingertips, her expression one of calm defiance against the chaos. Surrounding her, shards of broken buildings, flying debris, and a pink flamingo float slowly in suspension, capturing the beauty of destruction. Camera glides slowly and smoothly around her, revealing a dramatic but slow-motion panorama of the turbulent ocean illuminated by flashes of lightning that cut through the dark clouds. Intense emotional clash between serenity and impending doom, surreal spectacle of a quiet moment amidst the storm.
A muscular man stands confidently, arms crossed. Suddenly, a fist strikes him from the left, thrusting him into a moment of disbelief. In slow motion, his skin ripples, and strands of hair flicker as the impact unfolds, sending a shockwave through his body. The camera captures every nuance of his expression—surprise morphing into resilience, his eyes blazing with intensity.
The sky is alive with energy as two giant flying turtles soar majestically above a blurred landscape of mountains and rivers, carrying serene monks in meditation atop their shells. The turtles' eyes suddenly blaze with an intense blue glow, crackling bolts of electricity erupting into the air, illuminating the turbulent sky with flashes of supernatural power. As the lightning flickers, an intense wind bursts forth, thrashing the monks' flowing robes, creating a dynamic and chaotic atmosphere. The turtles split apart in a dramatic arc, veering left and right as they abandon the scene, leaving the monks momentarily isolated in the vast expanse. The focus shifts smoothly to the empty valley below, where mist thrums with energy, swirling dynamically
Double dolly shot: A fierce and determined woman stands in the center of a dusty Western street, her expression focused and intense as she points a shotgun towards an unseen foe. The camera glides backward, while the town around her drifts ominously, buildings and debris swirling in the distance. She remains unnaturally steady, her hair swirling in a slight breeze, amplifying the tension of the moment. This visual dissonance captures her psychological struggle, symbolizing her inner strength. The harsh sunlight casts long shadows, enhancing the gritty texture of her rugged attire and the arid landscape.
MAGI has done a decent job here. This was standard quality, generated 3 seconds then extended with no additional prompt by another 2. Not sure why it's 4 seconds long.
Well that's an unexpected hand! So 'hand with glossy red nails' became also a red hand, why not? At least this AI knows 1-1=0, bag doesn't also remain on shelf once picked up.
A 3-second video extended by 3 more second with the same exact prompt. Background behind woman once she moves her head isn't consistent with starry backdrop elsewhere.