# OmniVoice-Studio
**Repository Path**: HesenjanJava/OmniVoice-Studio
## Basic Information
- **Project Name**: OmniVoice-Studio
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-04-29
- **Last Updated**: 2026-04-29
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
Voice Clone
Drop a 3-second clip → mirror any voice. 646 languages, zero-shot.
|
Voice Design
Build new voices from scratch — gender, age, accent, pitch, style.
|
Video Dubbing
Upload or paste a YouTube URL. Transcribe, translate, re-voice, export.
|
Voice Gallery
Search YouTube, browse categories, download clips, build your library.
|
Settings → Models
15 models. One-click install. Auto-detects your platform (CUDA / MPS / CPU).
|
Projects
Dub projects, voice profiles, generation history, exports — all searchable.
|
Settings → Logs
Live backend, frontend, and Tauri runtime logs. Filter, refresh, clear.
|
---
## Why Open Source?
ElevenLabs charges **$5–$330/mo** and processes your audio on their servers. OmniVoice Studio runs **on your hardware, with no usage limits.**
| | **ElevenLabs** | **OmniVoice Studio** |
|---|---|---|
| **Pricing** | $5–$330/mo, per-character billing | Free for personal use · [Commercial license](#license) for business |
| **Voice Cloning** | ✅ 3s clip | ✅ 3s clip, zero-shot |
| **Voice Design** | ✅ Gender, age | ✅ Gender, age, accent, pitch, style, dialect |
| **Languages** | 32 | **646** |
| **Video Dubbing** | ✅ Cloud-only | ✅ Fully local |
| **Data Privacy** | Audio sent to cloud | **Nothing leaves your machine** |
| **API Keys** | Required | Not needed |
| **GPU Support** | N/A (cloud) | CUDA · Apple Silicon · ROCm · CPU |
| **Desktop App** | ❌ | ✅ macOS · Windows · Linux |
| **Customizable** | ❌ Closed | ✅ Fork it, extend it, ship it |
Built on the [OmniVoice](https://github.com/k2-fsa/OmniVoice) 600-language zero-shot diffusion TTS model. Upload a video, get broadcast-quality dubs in any language with the original speaker's voice preserved.
## Features
### Core Pipeline
- **Video Dubbing** — Transcribe → translate → synthesize → mux back to MP4. One-click end-to-end.
- **Vocal Isolation** — Demucs-powered speech/music separation. Background audio preserved automatically.
- **Voice Cloning** — Clone any voice from a 3-second clip. Zero-shot, 600+ languages.
- **Multi-Speaker Diarization** — Pyannote + WhisperX fusion auto-identifies speakers and assigns unique voice profiles.
### Studio Tools
- **Voice Capture** — Press `⌘+⇧+Space` **from any app** to dictate. Global system-wide hotkey records, transcribes, and auto-pastes into the active text field. Live partial results stream via WebSocket while you speak.
- **Speaker Casting** — Visual speaker-to-voice assignment grid. Auto-cast from video clones or assign saved profiles.
- **Voice Preview** — Floating widget for instant 8-step TTS testing. Try voices without leaving the workspace.
- **Real-time Dub Preview** — Edit a segment's text, preview the audio instantly without full re-render.
- **Multi-Language Batch** — Select multiple target languages, dub to all in one pass.
- **Batch Queue** — Drag-and-drop bulk video processing. Full pipeline: extract → transcribe → translate → generate → mix → export. Real-time progress bars per job.
- **Voice Library** — Browse, favorite, tag, and convert gallery clips into permanent voice profiles.
- **A/B Comparison** — Side-by-side voice audition for casting decisions.
### Production Export
- **Selective Track Export** — Choose which language tracks to include in the final MP4.
- **Subtitle Export** — SRT and VTT generation alongside dubbed video.
- **Stem Export** — Separate vocals and background audio as individual files.
- **Per-Segment Mixing** — 0–200% gain control per segment for broadcast-quality balancing.
### Technical
- **Cross-Platform GPU** — Auto-detects CUDA, Apple Silicon (MPS), ROCm, or CPU. Includes automatic cuDNN 8/9 compatibility handling.
- **VRAM-Aware** — Automatically offloads TTS to CPU during transcription on ≤8 GB GPUs. Zero config.
- **Streaming ASR** — WebSocket-based speech-to-text (`/ws/transcribe`) delivers live partial results during recording. 2s buffer interval, configurable.
- **Auto-Paste** — Dictated text is automatically pasted into the active app via system keyboard simulation (macOS Accessibility / Windows SendInput).
- **Live Telemetry** — Real-time CPU/RAM/VRAM stats with model warm-up indicator.
- **Keyboard-First** — `⌘+Enter` generate, `⌘+S` save, `⌘+Z`/`⌘+⇧+Z` undo/redo.
### AI Provenance
- **Invisible Watermark** — AudioSeal-powered (Meta) neural watermark embedded in every generated audio. Imperceptible, survives compression/editing.
- **Detection API** — Upload any audio to `/watermark/detect` to verify OmniVoice origin with confidence score.
- **Video Branding** — Optional logo overlay on exported MP4s (5s fade-out, bottom-right).
- **Configurable** — Toggle invisible/visible watermarks independently in Settings → Privacy.
### MCP Server (AI Agent Integration)
- **Model Context Protocol** — Expose OmniVoice as an AI agent tool for Claude, Cursor, and any MCP-compatible client.
- **5 Tools** — `generate_speech`, `list_voices`, `list_personalities`, `list_languages`, `check_health`.
- **stdio + SSE** — Works locally (Claude Desktop) or remotely (networked agents).
- **Zero config** — Drop `mcp.json` into your client config and go. See [`mcp.json`](mcp.json).
### Audio Effects Chain
- **6 presets** — Broadcast 📻, Cinematic 🎬, Podcast 🎙️, Warm ☀️, Bright ✨, Raw 🔇.
- **Pedalboard-powered** — Spotify's production-grade DSP (EQ, compressor, reverb, noise gate, limiter).
- **API-driven** — `GET /tools/effects` returns presets; custom chains via `apply_effects_chain()`.
### Plugin SDK (Third-Party TTS Engines)
- **Abstract interface** — Subclass `TTSPlugin` to add any TTS engine in ~50 lines.
- **Built-in plugins** — ElevenLabs (cloud) and Bark (local) ship out of the box.
- **Auto-discovery** — Drop a `.py` file in `backend/plugins/`, it registers automatically.
- **API** — `GET /tools/plugins` lists all engines and their availability status.
### GPU Safety
- **Crash sandbox** — GPU-intensive ops can run in subprocess isolation. A CUDA OOM or driver crash kills the worker, not the server.
- **6 color themes** — Gruvbox (default), Midnight Blue, Nord, Solarized, Rosé Pine, Catppuccin Mocha.
---
## Quickstart
### Docker (recommended)
```bash
git clone https://github.com/debpalash/OmniVoice-Studio.git
cd OmniVoice-Studio
# CPU mode
docker compose up --build -d
# Or with NVIDIA GPU
docker compose --profile gpu up --build -d
```
Open [http://localhost:3900](http://localhost:3900) once the health check passes. First run downloads ~4 GB of model weights — progress is shown in `docker compose logs -f`.
> **Network access:** the container binds to `127.0.0.1` only. To reach OmniVoice from another machine on your LAN, change the port mapping in `docker-compose.yml` to `"0.0.0.0:3900:3900"`. OmniVoice ships no built-in authentication — when exposing it beyond your machine, put it behind a reverse proxy with auth (Caddy `basic_auth`, nginx + htpasswd, Tailscale, etc.).
### Local Development
**Prerequisites:** [ffmpeg](https://ffmpeg.org/), [Bun](https://bun.sh/), [uv](https://docs.astral.sh/uv/)
```bash
git clone https://github.com/debpalash/OmniVoice-Studio.git
cd OmniVoice-Studio
bun install
bun run dev
```
This boots both services:
| Service | URL | Stack |
|---------|-----|-------|
| **Backend** | `localhost:3900` | FastAPI · 97 endpoints · WhisperX · Demucs · OmniVoice |
| **Frontend** | `localhost:3901` | React · Vite · Waveform timeline · Glassmorphism UI |
> [!NOTE]
> First run downloads model weights (~2.4 GB). This works out of the box — no account needed. For faster downloads, optionally set `HF_TOKEN=hf_...` in your environment ([get a free token here](https://huggingface.co/settings/tokens)).
>
> **Having issues?** Join our [Discord](https://discord.gg/aRRdVj3de7) for setup help and troubleshooting.
### Desktop App
Pre-built installers (~6–8 MB) are available on the [**Releases**](https://github.com/debpalash/OmniVoice-Studio/releases/latest) page. On first launch, the app bootstraps a Python environment and downloads model weights automatically — the splash screen shows progress.
To build from source instead:
```bash
bun run desktop # Launches Tauri native app (macOS / Windows / Linux)
```