Featured project · self-running demo

Audiogram Engine

/

Long-form audio becomes karaoke-captioned social video — transcribed by Whisper, corrected by Claude, rendered on the GPU into 9:16 and 16:9. 8–15 hrs of editing → under 30 min. Built end-to-end, solo.

NVIDIA CUDA · faster-whisper · h264_nvenc 9:16 + 16:9 from one render

Stage 1 · Transcription

Speech in, timed text out

Drop in a recording. faster-whisper transcribes it on CUDA in float16 — every word stamped with its own start, end, and confidence.

▮transcribe · job a1b2c3d4whisper-small · CUDA · float16

founders_talk.mp338:42 · 44.1 kHz · 35.4 MB · sha256 cached

whisper ›

word_timestamps=True · language=en · voice_enhance: −16 LUFS0%

GPU throughput

0.0× realtime

Words timed

0

Low-confidence flags

0

›Homophones and proper nouns are where Whisper slips. Words below the 0.55 confidence threshold are flagged for review — Stage 2 cleans them up.

Stage 2 · Review & Correction

A human reviews. Claude fixes the rest.

Whisper mishears homophones — a loan / alone. Claude rewrites the text against your source without touching a single word timing, so the karaoke stays perfectly synced.

▮correction/transcript.json3 flagged terms

Manual paste$0.00 generate-prompt → paste into claude.ai → paste JSON back. Your own account pays nothing here.

Automated$0.0148 engine calls the Anthropic API directly · claude-sonnet-4-5 · chunked under max_tokens·0.5

est. 0 in / 0 out tokens ·$3 / $15 per MTok ·≈ $0.0000 ± 20%

word timings: untouched — by contract

✓3 corrections applied · word timings byte-identical · review_method → ai_automated

Stage 3 · Render Configuration

Configure the look

Two aspects, six visual layers, per-aspect typography. Every change updates the live preview →

Aspects & layers

✓

9:16

Portrait

TikTokReelsShorts

✓

16:9

Landscape

YouTube

Caption style f14_by_aspect

font size60px·default

words / line4

position

highlight

#ff5e3a

Music ducking −12 dB under speech

depth−12 dB

▮ voice (RMS · 10 ms hops)▮ music · ducked

Live preview · burned-in

9:16

render-accurate · switch 9:16 / 16:9 above to edit each

Stage 4 · GPU Render

On the GPU: NumPy → NVENC

Worker threads composite frames in pure numpy and pipe raw RGB straight into h264_nvenc — no MoviePy on the hot path, ~13× realtime.

0%

GPU telemetry · RTX 3080

GPU utilization0%

VRAM2.1 / 10 GB

NVENC sessions

1/3

Encode speed

0.0×

Frames piped

0

ETA

—

AUDIOGRAM_FAST_COMPOSITOR=1 workers: cpu−2 (cap 8)

Stage 5 · Deliver

Two videos. One render.

Captions, waveform, b-roll and ducked music baked in — plus a poster JPG and thumbnail PNG per aspect. Press play.

9:16 · 1080×1920

TikTokReelsShorts

16:9 · 1920×1080

YouTube

0:00

final_audiogram.mp4 ✓ poster.jpg thumbnail.png · per-aspect .srt · .vtt · .txt

38:42 talk → 2:46 render·28× realtime·H.264 · 8 Mbps · +faststart

⤓ Request access to the live app The real app runs on a private GPU machine and is invite-only — this demo is the public window into it.

Rendered locally · nothing uploaded to the cloud