HeartTranscriptor dropped Jan 2026. It’s part of the HeartMula setup and handles the audio-to-text part.
You give it an audio clip, it spits out the words. Works for lyrics and plain speech too. So calling it a lyrics transcriber isn’t the best name, since it handles regular talking just fine.
You don’t need a powerful rig. It runs on 6–8 GB VRAM. Even works on CPU if you don’t mind it being slow.
You use it in ComfyUI. Just drag the HeartTranscriptor node in, hook up an audio input and connect a text output to show the words.
It does pretty well with clear audio. Singing can trip it up sometimes, like if someone hits a high note it might mishear stuff, like “VRAM” sounding like “VROM.” It’s better with normal talking.
Easy to plug into your setup. You just swap out your old audio-to-text node with this one. It has fewer settings and looks cleaner too.
If you'd like to access this model, you can explore the following possibilities: