Audiogram Enginelive demo
Sound
Auto
 Portfolio

Featured project · self-running demo

Audiogram Engine

/

Long-form audio becomes karaoke-captioned social video — transcribed by Whisper, corrected by Claude, rendered on the GPU into 9:16 and 16:9. 8–15 hrs of editing → under 30 min. Built end-to-end, solo.

NVIDIA CUDA · faster-whisper · h264_nvenc 9:16 + 16:9 from one render
Stage 1 · Transcription

Speech in, timed text out

Drop in a recording. faster-whisper transcribes it on CUDA in float16 — every word stamped with its own start, end, and confidence.

transcribe · job a1b2c3d4whisper-small · CUDA · float16
founders_talk.mp338:42 · 44.1 kHz · 35.4 MB · sha256 cached
whisper ›
word_timestamps=True · language=en · voice_enhance: −16 LUFS0%
GPU throughput
0.0× realtime
Words timed
0
Low-confidence flags
0
Homophones and proper nouns are where Whisper slips. Words below the 0.55 confidence threshold are flagged for review — Stage 2 cleans them up.
Stage 2 · Review & Correction

A human reviews. Claude fixes the rest.

Whisper mishears homophones — a loan / alone. Claude rewrites the text against your source without touching a single word timing, so the karaoke stays perfectly synced.

correction/transcript.json3 flagged terms
Manual paste$0.00 generate-prompt → paste into claude.ai → paste JSON back. Your own account pays nothing here.
Automated$0.0148 engine calls the Anthropic API directly · claude-sonnet-4-5 · chunked under max_tokens·0.5
est. 0 in / 0 out tokens ·$3 / $15 per MTok ·$0.0000 ± 20%
word timings: untouched — by contract
3 corrections applied · word timings byte-identical · review_method → ai_automated
Stage 3 · Render Configuration

Configure the look

Two aspects, six visual layers, per-aspect typography. Every change updates the live preview →

Aspects & layers
9:16
Portrait
TikTokReelsShorts
16:9
Landscape
YouTube
Caption style f14_by_aspect
font size60px·default
words / line4
position
highlight
#ff5e3a
Music ducking −12 dB under speech
depth−12 dB
▮ voice (RMS · 10 ms hops)▮ music · ducked
Live preview · burned-in
9:16
render-accurate · switch 9:16 / 16:9 above to edit each
Stage 4 · GPU Render

On the GPU: NumPy → NVENC

Worker threads composite frames in pure numpy and pipe raw RGB straight into h264_nvenc — no MoviePy on the hot path, ~13× realtime.

0%

    GPU telemetry · RTX 3080

    GPU utilization0%
    VRAM2.1 / 10 GB
    NVENC sessions
    1/3
    Encode speed
    0.0×
    Frames piped
    0
    ETA
    AUDIOGRAM_FAST_COMPOSITOR=1 workers: cpu−2 (cap 8)
    Stage 5 · Deliver

    Two videos. One render.

    Captions, waveform, b-roll and ducked music baked in — plus a poster JPG and thumbnail PNG per aspect. Press play.

    9:16 · 1080×1920
    TikTokReelsShorts
    16:9 · 1920×1080
    YouTube
    0:00
    0:00
    final_audiogram.mp4 ✓ poster.jpg thumbnail.png · per-aspect .srt · .vtt · .txt
    38:42 talk → 2:46 render·28× realtime·H.264 · 8 Mbps · +faststart
    ⤓  Request access to the live app The real app runs on a private GPU machine and is invite-only — this demo is the public window into it.

    Rendered locally · nothing uploaded to the cloud