Featured project · self-running demo
Long-form audio becomes karaoke-captioned social video — transcribed by Whisper, corrected by Claude, rendered on the GPU into 9:16 and 16:9. 8–15 hrs of editing → under 30 min. Built end-to-end, solo.
Drop in a recording. faster-whisper transcribes it on CUDA in float16 — every word stamped with its own start, end, and confidence.
Whisper mishears homophones — a loan / alone. Claude rewrites the text against your source without touching a single word timing, so the karaoke stays perfectly synced.
Two aspects, six visual layers, per-aspect typography. Every change updates the live preview →
Worker threads composite frames in pure numpy and pipe raw RGB straight into h264_nvenc — no MoviePy on the hot path, ~13× realtime.
Captions, waveform, b-roll and ducked music baked in — plus a poster JPG and thumbnail PNG per aspect. Press play.
Rendered locally · nothing uploaded to the cloud