From 45224d7ccd561a27b8f37aed33aab55ed10009b6 Mon Sep 17 00:00:00 2001 From: DrMelone <27028174+Classic298@users.noreply.github.com> Date: Mon, 29 Dec 2025 23:44:37 +0100 Subject: [PATCH] audio --- .../audio/speech-to-text/env-variables.md | 133 +++++- .../mistral-voxtral-integration.md | 125 +++++ .../speech-to-text/openai-stt-integration.md | 136 ++++++ .../audio/speech-to-text/stt-config.md | 53 ++- .../Kokoro-FastAPI-integration.md | 39 ++ .../chatterbox-tts-api-integration.md | 28 ++ .../text-to-speech/kokoro-web-integration.md | 23 + .../openai-edge-tts-integration.md | 63 +++ .../text-to-speech/openai-tts-integration.md | 120 +++++ .../openedai-speech-integration.md | 34 ++ docs/features/index.mdx | 8 +- docs/getting-started/env-configuration.mdx | 62 +++ docs/troubleshooting/audio.mdx | 430 ++++++++++++++++++ docs/troubleshooting/microphone-error.mdx | 38 -- 14 files changed, 1225 insertions(+), 67 deletions(-) create mode 100644 docs/features/audio/speech-to-text/mistral-voxtral-integration.md create mode 100644 docs/features/audio/speech-to-text/openai-stt-integration.md create mode 100644 docs/features/audio/text-to-speech/openai-tts-integration.md create mode 100644 docs/troubleshooting/audio.mdx delete mode 100644 docs/troubleshooting/microphone-error.mdx diff --git a/docs/features/audio/speech-to-text/env-variables.md b/docs/features/audio/speech-to-text/env-variables.md index 3104108d9..0f002057d 100644 --- a/docs/features/audio/speech-to-text/env-variables.md +++ b/docs/features/audio/speech-to-text/env-variables.md @@ -11,20 +11,119 @@ For a complete list of all Open WebUI environment variables, see the [Environmen ::: -The following is a summary of the environment variables for speech to text (STT). - -# Environment Variables For Speech To Text (STT) - -| Variable | Description | -|----------|-------------| -| `WHISPER_MODEL` | Sets the Whisper model to use for local Speech-to-Text | -| `WHISPER_MODEL_DIR` | Specifies the directory to store Whisper model files | -| `WHISPER_COMPUTE_TYPE` | Sets the compute type for Whisper model inference (e.g., `int8`, `float16`) | -| `WHISPER_LANGUAGE` | Specifies the ISO 639-1 (ISO 639-2 for Hawaiian and Cantonese) Speech-to-Text language to use for Whisper (language is predicted unless set) | -| `AUDIO_STT_ENGINE` | Specifies the Speech-to-Text engine to use (empty for local Whisper, or `openai`) | -| `AUDIO_STT_MODEL` | Specifies the Speech-to-Text model for OpenAI-compatible endpoints | -| `AUDIO_STT_OPENAI_API_BASE_URL` | Sets the OpenAI-compatible base URL for Speech-to-Text | -| `AUDIO_STT_OPENAI_API_KEY` | Sets the OpenAI API key for Speech-to-Text | -| `AUDIO_STT_AZURE_API_KEY` | Sets the Azure API key for Speech-to-Text | -| `AUDIO_STT_AZURE_REGION` | Sets the Azure region for Speech-to-Text | -| `AUDIO_STT_AZURE_LOCALES` | Sets the Azure locales for Speech-to-Text | +The following is a summary of the environment variables for speech to text (STT) and text to speech (TTS). + +:::tip UI Configuration +Most of these settings can also be configured in the **Admin Panel → Settings → Audio** tab. Environment variables take precedence on startup but can be overridden in the UI. +::: + +## Speech To Text (STT) Environment Variables + +### Local Whisper + +| Variable | Description | Default | +|----------|-------------|---------| +| `WHISPER_MODEL` | Whisper model size | `base` | +| `WHISPER_MODEL_DIR` | Directory to store Whisper model files | `{CACHE_DIR}/whisper/models` | +| `WHISPER_COMPUTE_TYPE` | Compute type for inference (see note below) | `int8` | +| `WHISPER_LANGUAGE` | ISO 639-1 language code (empty = auto-detect) | empty | +| `WHISPER_MODEL_AUTO_UPDATE` | Auto-download model updates | `false` | +| `WHISPER_VAD_FILTER` | Enable Voice Activity Detection filter | `false` | + +:::info WHISPER_COMPUTE_TYPE Options +- `int8` — CPU default, fastest but may not work on older GPUs +- `float16` — **Recommended for CUDA/GPU** +- `int8_float16` — Hybrid mode (int8 weights, float16 computation) +- `float32` — Maximum compatibility, slowest + +If using the `:cuda` Docker image with an older GPU, set `WHISPER_COMPUTE_TYPE=float16` to avoid errors. +::: + +### OpenAI-Compatible STT + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_STT_ENGINE` | STT engine: empty (local Whisper), `openai`, `azure`, `deepgram`, `mistral` | empty | +| `AUDIO_STT_MODEL` | STT model for external providers | empty | +| `AUDIO_STT_OPENAI_API_BASE_URL` | OpenAI-compatible API base URL | `https://api.openai.com/v1` | +| `AUDIO_STT_OPENAI_API_KEY` | OpenAI API key | empty | +| `AUDIO_STT_SUPPORTED_CONTENT_TYPES` | Comma-separated list of supported audio MIME types | empty | + +### Azure STT + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_STT_AZURE_API_KEY` | Azure Cognitive Services API key | empty | +| `AUDIO_STT_AZURE_REGION` | Azure region | `eastus` | +| `AUDIO_STT_AZURE_LOCALES` | Comma-separated locales (e.g., `en-US,de-DE`) | auto | +| `AUDIO_STT_AZURE_BASE_URL` | Custom Azure base URL (optional) | empty | +| `AUDIO_STT_AZURE_MAX_SPEAKERS` | Max speakers for diarization | `3` | + +### Deepgram STT + +| Variable | Description | Default | +|----------|-------------|---------| +| `DEEPGRAM_API_KEY` | Deepgram API key | empty | + +### Mistral STT + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_STT_MISTRAL_API_KEY` | Mistral API key | empty | +| `AUDIO_STT_MISTRAL_API_BASE_URL` | Mistral API base URL | `https://api.mistral.ai/v1` | +| `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS` | Use chat completions endpoint | `false` | + +## Text To Speech (TTS) Environment Variables + +### General TTS + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_TTS_ENGINE` | TTS engine: empty (disabled), `openai`, `elevenlabs`, `azure`, `transformers` | empty | +| `AUDIO_TTS_MODEL` | TTS model | `tts-1` | +| `AUDIO_TTS_VOICE` | Default voice | `alloy` | +| `AUDIO_TTS_SPLIT_ON` | Split text on: `punctuation` or `none` | `punctuation` | +| `AUDIO_TTS_API_KEY` | API key for ElevenLabs or Azure TTS | empty | + +### OpenAI-Compatible TTS + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_TTS_OPENAI_API_BASE_URL` | OpenAI-compatible TTS API base URL | `https://api.openai.com/v1` | +| `AUDIO_TTS_OPENAI_API_KEY` | OpenAI TTS API key | empty | +| `AUDIO_TTS_OPENAI_PARAMS` | Additional JSON params for OpenAI TTS | empty | + +### Azure TTS + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_TTS_AZURE_SPEECH_REGION` | Azure Speech region | `eastus` | +| `AUDIO_TTS_AZURE_SPEECH_BASE_URL` | Custom Azure Speech base URL (optional) | empty | +| `AUDIO_TTS_AZURE_SPEECH_OUTPUT_FORMAT` | Audio output format | `audio-24khz-160kbitrate-mono-mp3` | + +## Tips for Configuring Audio + +### Using Local Whisper STT + +For GPU acceleration issues or older GPUs, try setting: +```yaml +environment: + - WHISPER_COMPUTE_TYPE=float16 +``` + +### Using External TTS Services + +When running Open WebUI in Docker with an external TTS service: + +```yaml +environment: + - AUDIO_TTS_ENGINE=openai + - AUDIO_TTS_OPENAI_API_BASE_URL=http://host.docker.internal:5050/v1 + - AUDIO_TTS_OPENAI_API_KEY=your-api-key +``` + +:::tip +Use `host.docker.internal` on Docker Desktop (Windows/Mac) to access services on the host. On Linux, use the host IP or container networking. +::: + +For troubleshooting audio issues, see the [Audio Troubleshooting Guide](/troubleshooting/audio). diff --git a/docs/features/audio/speech-to-text/mistral-voxtral-integration.md b/docs/features/audio/speech-to-text/mistral-voxtral-integration.md new file mode 100644 index 000000000..f844d0ec2 --- /dev/null +++ b/docs/features/audio/speech-to-text/mistral-voxtral-integration.md @@ -0,0 +1,125 @@ +--- +sidebar_position: 2 +title: "Mistral Voxtral STT" +--- + +# Using Mistral Voxtral for Speech-to-Text + +This guide covers how to use Mistral's Voxtral model for Speech-to-Text with Open WebUI. Voxtral is Mistral's speech-to-text model that provides accurate transcription. + +## Requirements + +- A Mistral API key +- Open WebUI installed and running + +## Quick Setup (UI) + +1. Click your **profile icon** (bottom-left corner) +2. Select **Admin Panel** +3. Click **Settings** → **Audio** tab +4. Configure the following: + +| Setting | Value | +|---------|-------| +| **Speech-to-Text Engine** | `MistralAI` | +| **API Key** | Your Mistral API key | +| **STT Model** | `voxtral-mini-latest` (or leave empty for default) | + +5. Click **Save** + +## Available Models + +| Model | Description | +|-------|-------------| +| `voxtral-mini-latest` | Default transcription model (recommended) | + +## Environment Variables Setup + +If you prefer to configure via environment variables: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + environment: + - AUDIO_STT_ENGINE=mistral + - AUDIO_STT_MISTRAL_API_KEY=your-mistral-api-key + - AUDIO_STT_MODEL=voxtral-mini-latest + # ... other configuration +``` + +### All Mistral STT Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_STT_ENGINE` | Set to `mistral` | empty (uses local Whisper) | +| `AUDIO_STT_MISTRAL_API_KEY` | Your Mistral API key | empty | +| `AUDIO_STT_MISTRAL_API_BASE_URL` | Mistral API base URL | `https://api.mistral.ai/v1` | +| `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS` | Use chat completions endpoint | `false` | +| `AUDIO_STT_MODEL` | STT model | `voxtral-mini-latest` | + +## Transcription Methods + +Mistral supports two transcription methods: + +### Standard Transcription (Default) +Uses the dedicated transcription endpoint. This is the recommended method. + +### Chat Completions Method +Set `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS=true` to use Mistral's chat completions API for transcription. This method: +- Requires audio in mp3 or wav format (automatic conversion is attempted) +- May provide different results than the standard endpoint + +## Using STT + +1. Click the **microphone icon** in the chat input +2. Speak your message +3. Click the microphone again or wait for silence detection +4. Your speech will be transcribed and appear in the input box + +## Supported Audio Formats + +Voxtral accepts common audio formats. The system defaults to accepting `audio/*` and `video/webm`. + +If using the chat completions method, audio is automatically converted to mp3. + +## Troubleshooting + +### API Key Errors + +If you see "Mistral API key is required": +1. Verify your API key is entered correctly +2. Check the API key hasn't expired +3. Ensure your Mistral account has API access + +### Transcription Not Working + +1. Check container logs: `docker logs open-webui -f` +2. Verify the STT Engine is set to `MistralAI` +3. Try the standard transcription method (disable chat completions) + +### Audio Format Issues + +If using chat completions method and audio conversion fails: +- Ensure FFmpeg is available in the container +- Try recording in a different format (wav or mp3) +- Switch to the standard transcription method + +For more troubleshooting, see the [Audio Troubleshooting Guide](/troubleshooting/audio). + +## Comparison with Other STT Options + +| Feature | Mistral Voxtral | OpenAI Whisper | Local Whisper | +|---------|-----------------|----------------|---------------| +| **Cost** | Per-minute pricing | Per-minute pricing | Free | +| **Privacy** | Audio sent to Mistral | Audio sent to OpenAI | Audio stays local | +| **Model Options** | voxtral-mini-latest | whisper-1 | tiny → large | +| **GPU Required** | No | No | Recommended | + +## Cost Considerations + +Mistral charges per minute of audio for STT. Check [Mistral's pricing page](https://mistral.ai/products/la-plateforme#pricing) for current rates. + +:::tip +For free STT, use **Local Whisper** (the default) or the browser's **Web API** for basic transcription. +::: diff --git a/docs/features/audio/speech-to-text/openai-stt-integration.md b/docs/features/audio/speech-to-text/openai-stt-integration.md new file mode 100644 index 000000000..12dc9e60f --- /dev/null +++ b/docs/features/audio/speech-to-text/openai-stt-integration.md @@ -0,0 +1,136 @@ +--- +sidebar_position: 0 +title: "OpenAI STT Integration" +--- + +# Using OpenAI for Speech-to-Text + +This guide covers how to use OpenAI's Whisper API for Speech-to-Text with Open WebUI. This provides cloud-based transcription without needing local GPU resources. + +:::tip Looking for TTS? +See the companion guide: [Using OpenAI for Text-to-Speech](/features/audio/text-to-speech/openai-tts-integration) +::: + +## Requirements + +- An OpenAI API key with access to the Audio API +- Open WebUI installed and running + +## Quick Setup (UI) + +1. Click your **profile icon** (bottom-left corner) +2. Select **Admin Panel** +3. Click **Settings** → **Audio** tab +4. Configure the following: + +| Setting | Value | +|---------|-------| +| **Speech-to-Text Engine** | `OpenAI` | +| **API Base URL** | `https://api.openai.com/v1` | +| **API Key** | Your OpenAI API key | +| **STT Model** | `whisper-1` | +| **Supported Content Types** | Leave empty for defaults, or set `audio/wav,audio/mpeg,audio/webm` | + +5. Click **Save** + +## Available Models + +| Model | Description | +|-------|-------------| +| `whisper-1` | OpenAI's Whisper large-v2 model, hosted in the cloud | + +:::info +OpenAI currently only offers `whisper-1`. For more model options, use Local Whisper (built into Open WebUI) or other providers like Deepgram. +::: + +## Environment Variables Setup + +If you prefer to configure via environment variables: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + environment: + - AUDIO_STT_ENGINE=openai + - AUDIO_STT_OPENAI_API_BASE_URL=https://api.openai.com/v1 + - AUDIO_STT_OPENAI_API_KEY=sk-... + - AUDIO_STT_MODEL=whisper-1 + # ... other configuration +``` + +### All STT Environment Variables (OpenAI) + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_STT_ENGINE` | Set to `openai` | empty (uses local Whisper) | +| `AUDIO_STT_OPENAI_API_BASE_URL` | OpenAI API base URL | `https://api.openai.com/v1` | +| `AUDIO_STT_OPENAI_API_KEY` | Your OpenAI API key | empty | +| `AUDIO_STT_MODEL` | STT model | `whisper-1` | +| `AUDIO_STT_SUPPORTED_CONTENT_TYPES` | Allowed audio MIME types | `audio/*,video/webm` | + +### Supported Audio Formats + +By default, Open WebUI accepts `audio/*` and `video/webm` for transcription. If you need to restrict or expand supported formats, set `AUDIO_STT_SUPPORTED_CONTENT_TYPES`: + +```yaml +environment: + - AUDIO_STT_SUPPORTED_CONTENT_TYPES=audio/wav,audio/mpeg,audio/webm +``` + +OpenAI's Whisper API supports: `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, `webm` + +## Using STT + +1. Click the **microphone icon** in the chat input +2. Speak your message +3. Click the microphone again or wait for silence detection +4. Your speech will be transcribed and appear in the input box + +## OpenAI vs Local Whisper + +| Feature | OpenAI Whisper API | Local Whisper | +|---------|-------------------|---------------| +| **Latency** | Network dependent | Faster for short clips | +| **Cost** | Per-minute pricing | Free (uses your hardware) | +| **Privacy** | Audio sent to OpenAI | Audio stays local | +| **GPU Required** | No | Recommended for speed | +| **Model Options** | `whisper-1` only | tiny, base, small, medium, large | + +Choose **OpenAI** if: +- You don't have a GPU +- You want consistent performance +- Privacy isn't a concern + +Choose **Local Whisper** if: +- You want free transcription +- You need audio to stay private +- You have a GPU for acceleration + +## Troubleshooting + +### Microphone Not Working + +1. Ensure you're using HTTPS or localhost +2. Check browser microphone permissions +3. See [Microphone Access Issues](/troubleshooting/audio#microphone-access-issues) + +### Transcription Errors + +1. Check your OpenAI API key is valid +2. Verify the API Base URL is correct +3. Check container logs for error messages + +### Language Issues + +OpenAI's Whisper API automatically detects language. If you need to force a specific language, consider using Local Whisper with the `WHISPER_LANGUAGE` environment variable. + +For more troubleshooting, see the [Audio Troubleshooting Guide](/troubleshooting/audio). + +## Cost Considerations + +OpenAI charges per minute of audio for STT. See [OpenAI Pricing](https://platform.openai.com/docs/pricing) for current rates. + +:::tip +For free STT, use **Local Whisper** (the default) or the browser's **Web API** for basic transcription. +::: diff --git a/docs/features/audio/speech-to-text/stt-config.md b/docs/features/audio/speech-to-text/stt-config.md index fcc3d6fc9..1e709adb5 100644 --- a/docs/features/audio/speech-to-text/stt-config.md +++ b/docs/features/audio/speech-to-text/stt-config.md @@ -3,22 +3,25 @@ sidebar_position: 1 title: "Configuration" --- -Open Web UI supports both local, browser, and remote speech to text. +Open WebUI supports both local, browser, and remote speech to text. ![alt text](/images/tutorials/stt/image.png) ![alt text](/images/tutorials/stt/stt-providers.png) -## Cloud / Remote Speech To Text Proivders +## Cloud / Remote Speech To Text Providers -The following cloud speech to text providers are currently supported. API keys can be configured as environment variables (OpenAI) or in the admin settings page (both keys). +The following speech-to-text providers are supported: - | Service | API Key Required | - | ------------- | ------------- | - | OpenAI | ✅ | - | DeepGram | ✅ | +| Service | API Key Required | Guide | +|---------|------------------|-------| +| Local Whisper (default) | ❌ | Built-in, see [Environment Variables](/features/audio/speech-to-text/env-variables) | +| OpenAI (Whisper API) | ✅ | [OpenAI STT Guide](/features/audio/speech-to-text/openai-stt-integration) | +| Mistral (Voxtral) | ✅ | [Mistral Voxtral Guide](/features/audio/speech-to-text/mistral-voxtral-integration) | +| Deepgram | ✅ | — | +| Azure | ✅ | — | - WebAPI provides STT via the built-in browser STT provider. +**Web API** provides STT via the browser's built-in speech recognition (no API key needed, configured in user settings). ## Configuring Your STT Provider @@ -59,3 +62,37 @@ Once your recording has begun you can: - If you wish to abort the recording (for example, you wish to start a fresh recording) you can click on the 'x' icon to scape the recording interface ![alt text](/images/tutorials/stt/endstt.png) + +## Troubleshooting + +### Common Issues + +#### "int8 compute type not supported" Error + +If you see an error like `Requested int8 compute type, but the target device or backend do not support efficient int8 computation`, this usually means your GPU doesn't support the requested compute operations. + +**Solutions:** +- **Switch to the standard Docker image** instead of the `:cuda` image — older GPUs (Maxwell architecture, ~2014-2016) may not be supported +- **Change the compute type** using the `WHISPER_COMPUTE_TYPE` environment variable: + ```yaml + environment: + - WHISPER_COMPUTE_TYPE=float16 # or float32 + ``` + +:::tip +For smaller models like Whisper, CPU mode often provides comparable performance without GPU compatibility issues. The `:cuda` image primarily accelerates RAG embeddings and won't significantly impact STT speed for most users. +::: + +#### Microphone Not Working + +1. **Check browser permissions** — ensure your browser has microphone access +2. **Use HTTPS** — some browsers require secure connections for microphone access +3. **Try another browser** — Chrome typically has the best support for web audio APIs + +#### Poor Recognition Accuracy + +- **Set the language explicitly** using `WHISPER_LANGUAGE=en` (uses ISO 639-1 codes) +- **Use a larger Whisper model** — options: `tiny`, `base`, `small`, `medium`, `large` +- Larger models are more accurate but slower + +For more detailed troubleshooting, see the [Audio Troubleshooting Guide](/troubleshooting/audio). diff --git a/docs/features/audio/text-to-speech/Kokoro-FastAPI-integration.md b/docs/features/audio/text-to-speech/Kokoro-FastAPI-integration.md index 101ab230f..248d2c941 100644 --- a/docs/features/audio/text-to-speech/Kokoro-FastAPI-integration.md +++ b/docs/features/audio/text-to-speech/Kokoro-FastAPI-integration.md @@ -138,3 +138,42 @@ docker compose up --build **That's it!** For more information on building the Docker container, including changing ports, please refer to the [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) repository + +## Troubleshooting + +### NVIDIA GPU Not Detected + +If the GPU version isn't using your GPU: + +1. **Install NVIDIA Container Toolkit:** + ```bash + # Ubuntu/Debian + distribution=$(. /etc/os-release;echo $ID$VERSION_ID) + curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - + curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list + sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit + sudo systemctl restart docker + ``` + +2. **Verify GPU access:** + ```bash + docker run --rm --gpus all nvidia/cuda:12.2.0-base nvidia-smi + ``` + +### Connection Issues from Open WebUI + +If Open WebUI can't reach Kokoro: + +- Use `host.docker.internal:8880` instead of `localhost:8880` (Docker Desktop) +- If both are in Docker Compose, use `http://kokoro-fastapi-gpu:8880/v1` +- Verify the service is running: `curl http://localhost:8880/health` + +### CPU Version Performance + +The CPU version uses ONNX optimization and performs well for most use cases. If speed is a concern: + +- Consider upgrading to the GPU version +- Ensure no other heavy processes are running on the CPU +- The CPU version is recommended for systems without compatible NVIDIA GPUs + +For more troubleshooting tips, see the [Audio Troubleshooting Guide](/troubleshooting/audio). diff --git a/docs/features/audio/text-to-speech/chatterbox-tts-api-integration.md b/docs/features/audio/text-to-speech/chatterbox-tts-api-integration.md index 1b2c3c8d1..c65bf6186 100644 --- a/docs/features/audio/text-to-speech/chatterbox-tts-api-integration.md +++ b/docs/features/audio/text-to-speech/chatterbox-tts-api-integration.md @@ -234,3 +234,31 @@ For more information on `chatterbox-tts-api`, you can visit the [GitHub repo](ht - 📖 **Documentation**: See [API Documentation](https://github.com/travisvn/chatterbox-tts-api/blob/main/docs/API_README.md) and [Docker Guide](https://github.com/travisvn/chatterbox-tts-api/blob/main/docs/DOCKER_README.md) - 💬 **Discord**: [Join the Discord for this project](http://chatterboxtts.com/discord) + +## Troubleshooting + +### Memory Requirements + +Chatterbox has higher memory requirements than other TTS solutions: +- **Minimum:** 4GB RAM +- **Recommended:** 8GB+ RAM +- **GPU:** NVIDIA CUDA or Apple M-series (MPS) recommended + +If you experience memory issues, consider using a lighter alternative like [OpenAI Edge TTS](/features/audio/text-to-speech/openai-edge-tts-integration) or [Kokoro-FastAPI](/features/audio/text-to-speech/Kokoro-FastAPI-integration). + +### Docker Networking + +If Open WebUI can't connect to Chatterbox: + +- **Docker Desktop:** Use `http://host.docker.internal:4123/v1` +- **Docker Compose:** Use `http://chatterbox-tts-api:4123/v1` +- **Linux:** Use your host machine's IP address + +### First-Time Startup + +The first TTS request takes significantly longer as the model loads. Check logs with: +```bash +docker logs chatterbox-tts-api -f +``` + +For more troubleshooting tips, see the [Audio Troubleshooting Guide](/troubleshooting/audio). diff --git a/docs/features/audio/text-to-speech/kokoro-web-integration.md b/docs/features/audio/text-to-speech/kokoro-web-integration.md index 580161840..7c66f61bd 100644 --- a/docs/features/audio/text-to-speech/kokoro-web-integration.md +++ b/docs/features/audio/text-to-speech/kokoro-web-integration.md @@ -89,4 +89,27 @@ Visit the [**Kokoro Web Demo**](https://voice-generator.pages.dev) to preview al For additional options, voice customization guides, and advanced settings, visit the [GitHub repository](https://github.com/eduardolat/kokoro-web). +## Troubleshooting + +### Connection Issues + +If Open WebUI can't reach Kokoro Web: + +- **Docker Desktop (Windows/Mac):** Use `http://host.docker.internal:3000/api/v1` +- **Docker Compose (same network):** Use `http://kokoro-web:3000/api/v1` +- **Linux Docker:** Use your host machine's IP address + +### Voice Not Working + +1. Verify the secret API key matches in both the Kokoro Web config and Open WebUI settings +2. Test the API directly: + ```bash + curl -X POST http://localhost:3000/api/v1/audio/speech \ + -H "Authorization: Bearer your-api-key" \ + -H "Content-Type: application/json" \ + -d '{"input": "Hello world", "voice": "af_heart"}' + ``` + +For more troubleshooting tips, see the [Audio Troubleshooting Guide](/troubleshooting/audio). + **Enjoy natural AI voices in your OpenWebUI conversations!** diff --git a/docs/features/audio/text-to-speech/openai-edge-tts-integration.md b/docs/features/audio/text-to-speech/openai-edge-tts-integration.md index 7bd30a307..232a1993d 100644 --- a/docs/features/audio/text-to-speech/openai-edge-tts-integration.md +++ b/docs/features/audio/text-to-speech/openai-edge-tts-integration.md @@ -261,3 +261,66 @@ For direct support, you can visit the [Voice AI & TTS Discord](https://tts.travi ## 🎙️ Voice Samples [Play voice samples and see all available Edge TTS voices](https://tts.travisvn.com/) + +## Troubleshooting + +### Connection Issues + +#### "localhost" Not Working from Docker + +If Open WebUI runs in Docker and can't reach the TTS service at `localhost:5050`: + +**Solutions:** +- Use `host.docker.internal:5050` instead of `localhost:5050` (Docker Desktop on Windows/Mac) +- On Linux, use the host's IP address, or add `--network host` to your Docker run command +- If both services are in Docker Compose, use the container name: `http://openai-edge-tts:5050/v1` + +**Example Docker Compose for both services on the same network:** + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + environment: + - AUDIO_TTS_ENGINE=openai + - AUDIO_TTS_OPENAI_API_BASE_URL=http://openai-edge-tts:5050/v1 + - AUDIO_TTS_OPENAI_API_KEY=your_api_key_here + networks: + - webui-network + + openai-edge-tts: + image: travisvn/openai-edge-tts:latest + ports: + - "5050:5050" + environment: + - API_KEY=your_api_key_here + networks: + - webui-network + +networks: + webui-network: + driver: bridge +``` + +#### Testing the TTS Service + +Verify the TTS service is working independently: + +```bash +curl -X POST http://localhost:5050/v1/audio/speech \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer your_api_key_here" \ + -d '{"input": "Test message", "voice": "alloy"}' \ + --output test.mp3 +``` + +If this works but Open WebUI still can't connect, the issue is network-related between containers. + +### No Audio Output in Open WebUI + +1. Check that the API Base URL ends with `/v1` +2. Verify the API key matches between both services (or remove the requirement) +3. Check Open WebUI container logs: `docker logs open-webui` +4. Check openai-edge-tts logs: `docker logs openai-edge-tts` (or your container name) + +For more troubleshooting tips, see the [Audio Troubleshooting Guide](/troubleshooting/audio). diff --git a/docs/features/audio/text-to-speech/openai-tts-integration.md b/docs/features/audio/text-to-speech/openai-tts-integration.md new file mode 100644 index 000000000..5742c4d7d --- /dev/null +++ b/docs/features/audio/text-to-speech/openai-tts-integration.md @@ -0,0 +1,120 @@ +--- +sidebar_position: 0 +title: "OpenAI TTS Integration" +--- + +# Using OpenAI for Text-to-Speech + +This guide covers how to use OpenAI's official Text-to-Speech API with Open WebUI. This is the simplest setup if you already have an OpenAI API key. + +:::tip Looking for STT? +See the companion guide: [Using OpenAI for Speech-to-Text](/features/audio/speech-to-text/openai-stt-integration) +::: + +## Requirements + +- An OpenAI API key with access to the Audio API +- Open WebUI installed and running + +## Quick Setup (UI) + +1. Click your **profile icon** (bottom-left corner) +2. Select **Admin Panel** +3. Click **Settings** → **Audio** tab +4. Configure the following: + +| Setting | Value | +|---------|-------| +| **Text-to-Speech Engine** | `OpenAI` | +| **API Base URL** | `https://api.openai.com/v1` | +| **API Key** | Your OpenAI API key | +| **TTS Model** | `tts-1` or `tts-1-hd` | +| **TTS Voice** | Choose from available voices | + +5. Click **Save** + +## Available Models + +| Model | Description | Best For | +|-------|-------------|----------| +| `tts-1` | Standard quality, lower latency | Real-time applications, faster responses | +| `tts-1-hd` | Higher quality audio | Pre-recorded content, premium audio quality | + +## Available Voices + +OpenAI provides 6 built-in voices: + +| Voice | Description | +|-------|-------------| +| `alloy` | Neutral, balanced | +| `echo` | Warm, conversational | +| `fable` | Expressive, British accent | +| `onyx` | Deep, authoritative | +| `nova` | Friendly, upbeat | +| `shimmer` | Soft, gentle | + +:::tip +Try different voices to find the one that best suits your use case. You can preview voices in OpenAI's documentation. +::: + +## Environment Variables Setup + +If you prefer to configure via environment variables: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + environment: + - AUDIO_TTS_ENGINE=openai + - AUDIO_TTS_OPENAI_API_BASE_URL=https://api.openai.com/v1 + - AUDIO_TTS_OPENAI_API_KEY=sk-... + - AUDIO_TTS_MODEL=tts-1 + - AUDIO_TTS_VOICE=alloy + # ... other configuration +``` + +### All TTS Environment Variables + +| Variable | Description | Default | +|----------|-------------|---------| +| `AUDIO_TTS_ENGINE` | Set to `openai` | empty | +| `AUDIO_TTS_OPENAI_API_BASE_URL` | OpenAI API base URL | `https://api.openai.com/v1` | +| `AUDIO_TTS_OPENAI_API_KEY` | Your OpenAI API key | empty | +| `AUDIO_TTS_MODEL` | TTS model (`tts-1` or `tts-1-hd`) | `tts-1` | +| `AUDIO_TTS_VOICE` | Voice to use | `alloy` | + +## Testing TTS + +1. Start a new chat +2. Send a message to any model +3. Click the **speaker icon** on the AI response to hear it read aloud + +## Troubleshooting + +### No Audio Plays + +1. Check your OpenAI API key is valid and has Audio API access +2. Verify the API Base URL is correct (`https://api.openai.com/v1`) +3. Check browser console (F12) for errors + +### Audio Quality Issues + +- Switch from `tts-1` to `tts-1-hd` for higher quality +- Note: `tts-1-hd` has slightly higher latency + +### Rate Limits + +OpenAI has rate limits on the Audio API. If you're hitting limits: +- Consider caching common phrases +- Use `tts-1` instead of `tts-1-hd` (uses fewer tokens) + +For more troubleshooting, see the [Audio Troubleshooting Guide](/troubleshooting/audio). + +## Cost Considerations + +OpenAI charges per character for TTS. See [OpenAI Pricing](https://platform.openai.com/docs/pricing) for current rates. Note that `tts-1-hd` costs more than `tts-1`. + +:::info +For a free alternative, consider [OpenAI Edge TTS](/features/audio/text-to-speech/openai-edge-tts-integration) which uses Microsoft's free Edge browser TTS. +::: diff --git a/docs/features/audio/text-to-speech/openedai-speech-integration.md b/docs/features/audio/text-to-speech/openedai-speech-integration.md index b4813e71f..f866fe605 100644 --- a/docs/features/audio/text-to-speech/openedai-speech-integration.md +++ b/docs/features/audio/text-to-speech/openedai-speech-integration.md @@ -190,6 +190,40 @@ If you encounter any problems integrating `openedai-speech` with Open WebUI, fol - If you're still experiencing issues, try restarting the `openedai-speech` service or the entire Docker environment. - If the problem persists, consult the `openedai-speech` GitHub repository or seek help on a relevant community forum. +### GPU Memory Issues (XTTS) + +If XTTS fails to load or causes out-of-memory errors: +- XTTS requires approximately 4GB of GPU VRAM +- Consider using the minimal Piper-only image (`docker-compose.min.yml`) which runs on CPU +- Reduce other GPU memory usage before starting the container + +### AMD GPU (ROCm) Notes + +When using AMD GPUs: +1. Uncomment `USE_ROCM=1` in your `speech.env` file +2. Use the `docker-compose.rocm.yml` file +3. Ensure ROCm drivers are properly installed on the host + +### ARM64 / Apple Silicon + +- XTTS has CPU-only support on ARM64 and will be **very slow** +- Use the Piper-only image (`docker-compose.min.yml`) for acceptable performance on ARM devices +- Apple M-series chips work but benefit from the minimal image + +### Container Networking + +If using Docker networks: +```yaml +# Add to your Docker Compose +networks: + webui-network: + driver: bridge +``` + +Then reference `http://openedai-speech:8000/v1` instead of `localhost`. + +For more troubleshooting tips, see the [Audio Troubleshooting Guide](/troubleshooting/audio). + ## FAQ **How can I control the emotional range of the generated audio?** diff --git a/docs/features/index.mdx b/docs/features/index.mdx index f7c62bbfc..40417975a 100644 --- a/docs/features/index.mdx +++ b/docs/features/index.mdx @@ -324,16 +324,16 @@ import { TopBanners } from "@site/src/components/TopBanners"; ### 🎙️ Audio, Voice, & Accessibility - 🗣️ **Voice Input Support with Multiple Providers**: Engage with your model through voice interactions using multiple Speech-to-Text providers: Local Whisper (default, with VAD filtering), OpenAI-compatible endpoints, Deepgram, and Azure Speech Services. Enjoy the convenience of talking to your model directly with automatic voice input after 3 seconds of silence for a streamlined experience. [Explore Audio Features](/category/speech-to-text--text-to-speech). - - Microphone access requires manually setting up a secure connection over HTTPS to work, or [manually whitelisting your URL at your own risk](https://docs.openwebui.com/troubleshooting/microphone-error). + - Microphone access requires manually setting up a secure connection over HTTPS to work, or [manually whitelisting your URL at your own risk](/troubleshooting/audio#solutions-for-non-https-connections). - 😊 **Emoji Call**: Toggle this feature on from the `Settings` > `Interface` menu, allowing LLMs to express emotions using emojis during voice calls for a more dynamic interaction. - - Microphone access requires manually setting up a secure connection over HTTPS to work, or [manually whitelisting your URL at your own risk](https://docs.openwebui.com/troubleshooting/microphone-error). + - Microphone access requires manually setting up a secure connection over HTTPS to work, or [manually whitelisting your URL at your own risk](/troubleshooting/audio#solutions-for-non-https-connections). - 🎙️ **Hands-Free Voice Call Feature**: Initiate voice calls without needing to use your hands, making interactions more seamless. - - Microphone access requires manually setting up a secure connection over HTTPS to work, or [manually whitelisting your URL at your own risk](https://docs.openwebui.com/troubleshooting/microphone-error). + - Microphone access requires manually setting up a secure connection over HTTPS to work, or [manually whitelisting your URL at your own risk](/troubleshooting/audio#solutions-for-non-https-connections). - 📹 **Video Call Feature**: Enable video calls with supported vision models like LlaVA and GPT-4o, adding a visual dimension to your communications. - - Both Camera & Microphone access is required using a secure connection over HTTPS for this feature to work, or [manually whitelisting your URL at your own risk](https://docs.openwebui.com/troubleshooting/microphone-error). + - Both Camera & Microphone access is required using a secure connection over HTTPS for this feature to work, or [manually whitelisting your URL at your own risk](/troubleshooting/audio#solutions-for-non-https-connections). - 👆 **Tap to Interrupt**: Stop the AI’s speech during voice conversations with a simple tap on mobile devices, ensuring seamless control over the interaction. diff --git a/docs/getting-started/env-configuration.mdx b/docs/getting-started/env-configuration.mdx index 0671cb1f1..93c44ca08 100644 --- a/docs/getting-started/env-configuration.mdx +++ b/docs/getting-started/env-configuration.mdx @@ -3522,6 +3522,20 @@ Note: If none of the specified languages are available and `en` was not in your - Description: Specifies the locales to use for Azure Speech-to-Text. - Persistence: This environment variable is a `PersistentConfig` variable. +#### `AUDIO_STT_AZURE_BASE_URL` + +- Type: `str` +- Default: `None` +- Description: Specifies a custom Azure base URL for Speech-to-Text. Use this if you have a custom Azure endpoint. +- Persistence: This environment variable is a `PersistentConfig` variable. + +#### `AUDIO_STT_AZURE_MAX_SPEAKERS` + +- Type: `int` +- Default: `3` +- Description: Sets the maximum number of speakers for Azure Speech-to-Text diarization. +- Persistence: This environment variable is a `PersistentConfig` variable. + ### Speech-to-Text (Deepgram) #### `DEEPGRAM_API_KEY` @@ -3531,6 +3545,38 @@ Note: If none of the specified languages are available and `en` was not in your - Description: Specifies the Deepgram API key to use for Speech-to-Text. - Persistence: This environment variable is a `PersistentConfig` variable. +### Speech-to-Text (Mistral) + +#### `AUDIO_STT_MISTRAL_API_KEY` + +- Type: `str` +- Default: `None` +- Description: Specifies the Mistral API key to use for Speech-to-Text. +- Persistence: This environment variable is a `PersistentConfig` variable. + +#### `AUDIO_STT_MISTRAL_API_BASE_URL` + +- Type: `str` +- Default: `https://api.mistral.ai/v1` +- Description: Specifies the Mistral API base URL to use for Speech-to-Text. +- Persistence: This environment variable is a `PersistentConfig` variable. + +#### `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS` + +- Type: `bool` +- Default: `False` +- Description: When enabled, uses the chat completions endpoint for Mistral Speech-to-Text instead of the dedicated transcription endpoint. +- Persistence: This environment variable is a `PersistentConfig` variable. + +### Speech-to-Text (General) + +#### `AUDIO_STT_SUPPORTED_CONTENT_TYPES` + +- Type: `str` +- Default: `None` +- Description: Comma-separated list of supported audio MIME types for Speech-to-Text (e.g., `audio/wav,audio/mpeg,video/*`). Leave empty to use defaults. +- Persistence: This environment variable is a `PersistentConfig` variable. + ### Text-to-Speech #### `AUDIO_TTS_API_KEY` @@ -3583,9 +3629,17 @@ Note: If none of the specified languages are available and `en` was not in your #### `AUDIO_TTS_AZURE_SPEECH_OUTPUT_FORMAT` - Type: `str` +- Default: `audio-24khz-160kbitrate-mono-mp3` - Description: Sets the output format for Azure Text to Speech. - Persistence: This environment variable is a `PersistentConfig` variable. +#### `AUDIO_TTS_AZURE_SPEECH_BASE_URL` + +- Type: `str` +- Default: `None` +- Description: Specifies a custom Azure Speech base URL for Text-to-Speech. Use this if you have a custom Azure endpoint. +- Persistence: This environment variable is a `PersistentConfig` variable. + ### Voice Mode #### `VOICE_MODE_PROMPT_TEMPLATE` @@ -3610,6 +3664,14 @@ Note: If none of the specified languages are available and `en` was not in your - Description: Sets the API key to use for text-to-speech. - Persistence: This environment variable is a `PersistentConfig` variable. +#### `AUDIO_TTS_OPENAI_PARAMS` + +- Type: `str` (JSON) +- Default: `{}` +- Description: Additional parameters for OpenAI-compatible TTS API in JSON format. Allows customization of API-specific settings. +- Example: `{"speed": 1.0}` +- Persistence: This environment variable is a `PersistentConfig` variable. + ### Elevenlabs Text-to-Speech #### `ELEVENLABS_API_BASE_URL` diff --git a/docs/troubleshooting/audio.mdx b/docs/troubleshooting/audio.mdx new file mode 100644 index 000000000..97de3b0c3 --- /dev/null +++ b/docs/troubleshooting/audio.mdx @@ -0,0 +1,430 @@ +--- +sidebar_position: 3 +title: "Audio Troubleshooting" +--- + +import { TopBanners } from "@site/src/components/TopBanners"; + + + +# Audio Troubleshooting Guide + +This page covers common issues with Speech-to-Text (STT) and Text-to-Speech (TTS) functionality in Open WebUI, along with their solutions. + +## Where to Find Audio Settings + +### Admin Settings (Server-Wide) + +Admins can configure server-wide audio defaults: + +1. Click your **profile icon** (bottom-left corner) +2. Select **Admin Panel** +3. Click **Settings** in the top navigation +4. Select the **Audio** tab + +Here you can configure: +- **Speech-to-Text Engine** — Choose between local Whisper, OpenAI, Azure, Deepgram, or Mistral +- **Whisper Model** — Select model size for local STT (tiny, base, small, medium, large) +- **Text-to-Speech Engine** — Choose between OpenAI-compatible, ElevenLabs, Azure, or local Transformers +- **TTS Voice** — Select the default voice +- **API Keys and Base URLs** — Configure external service connections + +### User Settings (Per-User) + +Individual users can customize their audio experience: + +1. Click your **profile icon** (bottom-left corner) +2. Select **Settings** +3. Click the **Audio** tab + +User-level options include: +- **STT Engine Override** — Use "Web API" for browser-based speech recognition +- **STT Language** — Set preferred language for transcription +- **TTS Engine** — Choose "Browser Kokoro" for local in-browser TTS +- **TTS Voice** — Select from available voices +- **Auto-playback** — Automatically play AI responses +- **Playback Speed** — Adjust audio speed +- **Conversation Mode** — Enable hands-free voice interaction + +:::tip +User settings override admin defaults. If you're having issues, check both locations to ensure settings aren't conflicting. +::: + +## Quick Setup Guide + +### Fastest Setup: OpenAI (Paid) + +If you have an OpenAI API key, this is the simplest setup: + +**In Admin Panel → Settings → Audio:** +- **STT Engine:** `OpenAI` | **Model:** `whisper-1` +- **TTS Engine:** `OpenAI` | **Model:** `tts-1` | **Voice:** `alloy` +- Enter your OpenAI API key in both sections + +Or via environment variables: +```yaml +environment: + - AUDIO_STT_ENGINE=openai + - AUDIO_STT_OPENAI_API_KEY=sk-... + - AUDIO_TTS_ENGINE=openai + - AUDIO_TTS_OPENAI_API_KEY=sk-... + - AUDIO_TTS_MODEL=tts-1 + - AUDIO_TTS_VOICE=alloy +``` + +→ See full guides: [Speech-to-Text](/category/speech-to-text) | [Text-to-Speech](/category/text-to-speech) + +### Free Setup: Local Whisper + Edge TTS + +For a completely free setup: + +**STT:** Leave engine empty (uses built-in Whisper) +```yaml +environment: + - WHISPER_MODEL=base # Options: tiny, base, small, medium, large +``` + +**TTS:** Use OpenAI Edge TTS (free Microsoft voices) +```yaml +services: + openai-edge-tts: + image: travisvn/openai-edge-tts:latest + ports: + - "5050:5050" + + open-webui: + environment: + - AUDIO_TTS_ENGINE=openai + - AUDIO_TTS_OPENAI_API_BASE_URL=http://openai-edge-tts:5050/v1 + - AUDIO_TTS_OPENAI_API_KEY=not-needed +``` + +→ See full guide: [OpenAI Edge TTS](/features/audio/text-to-speech/openai-edge-tts-integration) + +### Browser-Only Setup (No Config Needed) + +For basic functionality without any server-side setup: + +**In User Settings → Audio:** +- **STT Engine:** `Web API` (uses browser's built-in speech recognition) +- **TTS Engine:** `Web API` (uses browser's built-in text-to-speech) + +:::note +Browser-based audio has limited accuracy and voice options compared to server-side solutions. +::: + +## Microphone Access Issues + +### Understanding Secure Contexts 🔒 + +For security reasons, accessing the microphone is restricted to pages served over HTTPS or locally from `localhost`. This requirement is meant to safeguard your data by ensuring it is transmitted over secure channels. + +### Common Permission Issues 🚫 + +Browsers like Chrome, Brave, Microsoft Edge, Opera, and Vivaldi, as well as Firefox, restrict microphone access on non-HTTPS URLs. This typically becomes an issue when accessing a site from another device within the same network (e.g., using a mobile phone to access a desktop server). + +### Solutions for Non-HTTPS Connections + +1. **Set Up HTTPS (Recommended):** + - Configure your server to support HTTPS. This not only resolves permission issues but also enhances the security of your data transmissions. + - You can use a reverse proxy like Nginx or Caddy with Let's Encrypt certificates. + +2. **Temporary Browser Flags (Use with caution):** + - These settings force your browser to treat certain insecure URLs as secure. This is useful for development purposes but poses significant security risks. + + **Chromium-based Browsers (e.g., Chrome, Brave):** + - Open `chrome://flags/#unsafely-treat-insecure-origin-as-secure` + - Enter your non-HTTPS address (e.g., `http://192.168.1.35:3000`) + - Restart the browser to apply the changes + + **Firefox-based Browsers:** + - Open `about:config` + - Search and modify (or create) the string value `dom.securecontext.allowlist` + - Add your IP addresses separated by commas (e.g., `http://127.0.0.1:8080`) + +:::warning +While browser flags offer a quick fix, they bypass important security checks which can expose your device and data to vulnerabilities. Always prioritize proper security measures, especially when planning for a production environment. +::: + +### Microphone Not Working + +If the microphone icon doesn't respond even on HTTPS: + +1. **Check browser permissions:** Ensure your browser has microphone access for the site +2. **Check system permissions:** On Windows/Mac, ensure the browser has microphone access in system settings +3. **Check browser compatibility:** Some browsers have limited STT support +4. **Try a different browser:** Chrome typically has the best support for web audio APIs + +--- + +## Text-to-Speech (TTS) Issues + +### TTS Loading Forever / Not Working + +If clicking the play button on chat responses causes endless loading, try the following solutions: + +#### 1. Hugging Face Dataset Library Conflict (Local Transformers TTS) + +**Symptoms:** +- TTS keeps loading forever +- Container logs show: `RuntimeError: Dataset scripts are no longer supported, but found cmu-arctic-xvectors.py` + +**Cause:** This occurs when using local Transformers TTS (`AUDIO_TTS_ENGINE=transformers`). The `datasets` library is pulled in as an indirect dependency of the `transformers` package and isn't pinned to a specific version in Open WebUI's requirements. Newer versions of `datasets` removed support for dataset loading scripts, causing this error when loading speaker embeddings. + +**Solutions:** + +**Temporary fix** (re-applies after container restart): +```bash +docker exec open-webui bash -lc "pip install datasets==3.6.0" && docker restart open-webui +``` + +**Permanent fix using environment variable:** +Add this to your `docker-compose.yml`: +```yaml +environment: + - EXTRA_PIP_PACKAGES=datasets==3.6.0 +``` + +**Verify the installed version:** +```bash +docker exec open-webui bash -lc "pip show datasets" +``` + +:::tip +Consider using an external TTS service like [OpenAI Edge TTS](/features/audio/text-to-speech/openai-edge-tts-integration) or [Kokoro](/features/audio/text-to-speech/Kokoro-FastAPI-integration) instead of local Transformers TTS to avoid these dependency conflicts. +::: + +#### 2. Using External TTS Instead of Local + +If you continue to have issues with local TTS, configuring an external TTS service is often more reliable. See the example Docker Compose configuration below that uses `openai-edge-tts`: + +```yaml +services: + open-webui: + image: ghcr.io/open-webui/open-webui:main + environment: + - AUDIO_TTS_ENGINE=openai + - AUDIO_TTS_OPENAI_API_KEY=your-api-key-here + - AUDIO_TTS_OPENAI_API_BASE_URL=http://openai-edge-tts:5050/v1 + depends_on: + - openai-edge-tts + # ... other configuration + + openai-edge-tts: + image: travisvn/openai-edge-tts:latest + ports: + - "5050:5050" + environment: + - API_KEY=your-api-key-here + restart: unless-stopped +``` + +### TTS Voice Not Found / No Audio Output + +**Checklist:** +1. Verify the TTS engine is correctly configured in **Admin Panel → Settings → Audio** +2. Check that the voice name matches an available voice for your chosen engine +3. For external TTS services, verify the API Base URL is accessible from the Open WebUI container +4. Check container logs for any error messages + +### Docker Networking Issues with TTS + +If Open WebUI can't reach your TTS service: + +**Problem:** Using `localhost` in the API Base URL doesn't work from within Docker. + +**Solutions:** +- Use `host.docker.internal` instead of `localhost` (works on Docker Desktop for Windows/Mac) +- Use the container name if both services are on the same Docker network (e.g., `http://openai-edge-tts:5050/v1`) +- Use the host machine's IP address + +--- + +## Speech-to-Text (STT) Issues + +### Whisper STT Not Working / Compute Type Error + +**Symptoms:** +- Error message: `Requested int8 compute type, but the target device or backend do not support efficient int8 computation` +- STT fails to process audio + +**Cause:** This typically occurs when using the `:cuda` Docker image with an older NVIDIA GPU that doesn't support the required compute operations (e.g., Maxwell architecture GPUs like Tesla M60). + +**Solutions:** + +#### Switch to the Standard Image + +Older GPUs (Maxwell architecture, ~2014-2016) may not be supported by modern ML libraries with CUDA acceleration. Switch to the standard Docker image instead: + +```bash +# Instead of: +# ghcr.io/open-webui/open-webui:cuda + +# Use: +ghcr.io/open-webui/open-webui:main +``` + +:::info +The CUDA image primarily accelerates RAG embedding/reranking models and Whisper STT. For smaller models like Whisper, CPU mode often provides comparable performance without the compatibility issues. +::: + +#### Adjust Whisper Compute Type + +If you want to keep GPU acceleration, try changing the compute type: + +```yaml +environment: + - WHISPER_COMPUTE_TYPE=float16 # Recommended for GPU +``` + +**Available compute types (from faster-whisper):** + +| Compute Type | Best For | Notes | +|--------------|----------|-------| +| `int8` | **CPU (default)** | Fastest, but doesn't work on older GPUs | +| `float16` | **CUDA/GPU (recommended)** | Best balance of speed and compatibility for GPUs | +| `int8_float16` | GPU with hybrid precision | Uses int8 for weights, float16 for computation | +| `float32` | Maximum compatibility | Slowest, but works on all hardware | + +:::info Default Behavior +- **CPU mode:** Defaults to `int8` for best performance +- **CUDA mode:** The `:cuda` image may default to `int8`, which can cause errors on older GPUs. Set `float16` explicitly for GPUs. +::: + +### STT Not Recognizing Speech Correctly + +**Tips for better recognition:** + +1. **Set the correct language:** + ```yaml + environment: + - WHISPER_LANGUAGE=en # Use ISO 639-1 language code + ``` + +2. **Try a larger Whisper model** for better accuracy (at the cost of speed): + ```yaml + environment: + - WHISPER_MODEL=medium # Options: tiny, base, small, medium, large + ``` + +3. **Check microphone permissions** in your browser (see above) + +4. **Use the Web API engine** as an alternative: + - Go to user settings (not admin panel) + - Under STT Settings, try switching Speech-to-Text Engine to "Web API" + - This uses the browser's built-in speech recognition + +--- + +## ElevenLabs Integration + +ElevenLabs is natively supported in Open WebUI. To configure: + +1. Go to **Admin Panel → Settings → Audio** +2. Select **ElevenLabs** as the TTS engine +3. Enter your ElevenLabs API key +4. Select the voice and model +5. Save settings + +**Using environment variables:** + +```yaml +environment: + - AUDIO_TTS_ENGINE=elevenlabs + - AUDIO_TTS_API_KEY=sk_... # Your ElevenLabs API key + - AUDIO_TTS_VOICE=EXAVITQu4vr4xnSDxMaL # Voice ID from ElevenLabs dashboard + - AUDIO_TTS_MODEL=eleven_multilingual_v2 +``` + +:::note +You can find your Voice ID in the ElevenLabs dashboard under the voice settings. Common model options are `eleven_multilingual_v2` or `eleven_monolingual_v1`. +::: + +--- + +## General Debugging Tips + +### Check Container Logs + +```bash +# View Open WebUI logs +docker logs open-webui -f + +# View logs for external TTS service (if applicable) +docker logs openai-edge-tts -f +``` + +### Check Browser Console + +1. Open browser developer tools (F12 or right-click → Inspect) +2. Go to the Console tab +3. Look for error messages when attempting to use audio features + +### Verify Service Health + +For external TTS services, test directly: + +```bash +# Test OpenAI Edge TTS +curl -X POST http://localhost:5050/v1/audio/speech \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer your_api_key_here" \ + -d '{"input": "Hello, this is a test.", "voice": "alloy"}' \ + --output test.mp3 +``` + +### Network Connectivity + +Verify the Open WebUI container can reach external services: + +```bash +# Enter the container +docker exec -it open-webui bash + +# Test connectivity (if curl is available) +curl http://your-tts-service:port/health +``` + +--- + +## Quick Reference: Environment Variables + +### TTS Environment Variables + +| Variable | Description | +|----------|-------------| +| `AUDIO_TTS_ENGINE` | TTS engine: empty (disabled), `openai`, `elevenlabs`, `azure`, `transformers` | +| `AUDIO_TTS_MODEL` | TTS model to use (default: `tts-1`) | +| `AUDIO_TTS_VOICE` | Default voice for TTS (default: `alloy`) | +| `AUDIO_TTS_API_KEY` | API key for ElevenLabs or Azure TTS | +| `AUDIO_TTS_OPENAI_API_BASE_URL` | Base URL for OpenAI-compatible TTS | +| `AUDIO_TTS_OPENAI_API_KEY` | API key for OpenAI-compatible TTS | + +### STT Environment Variables + +| Variable | Description | +|----------|-------------| +| `WHISPER_MODEL` | Whisper model: `tiny`, `base`, `small`, `medium`, `large` (default: `base`) | +| `WHISPER_COMPUTE_TYPE` | Compute type: `int8`, `float16`, `int8_float16`, `float32` (default: `int8`) | +| `WHISPER_LANGUAGE` | ISO 639-1 language code (empty = auto-detect) | +| `AUDIO_STT_ENGINE` | STT engine: empty (local Whisper), `openai`, `azure`, `deepgram` | +| `AUDIO_STT_OPENAI_API_BASE_URL` | Base URL for OpenAI-compatible STT | +| `AUDIO_STT_OPENAI_API_KEY` | API key for OpenAI-compatible STT | +| `DEEPGRAM_API_KEY` | Deepgram API key | + +For a complete list of audio environment variables, see [Environment Variable Configuration](/getting-started/env-configuration#audio). + +--- + +## Still Having Issues? + +If you've tried the above solutions and still experience problems: + +1. **Search existing issues** on GitHub for similar problems +2. **Check the discussions** for community solutions +3. **Create a new issue** with: + - Open WebUI version + - Docker image being used + - Complete error logs + - Very detailed steps to reproduce + - Your environment details (OS, GPU if applicable) diff --git a/docs/troubleshooting/microphone-error.mdx b/docs/troubleshooting/microphone-error.mdx deleted file mode 100644 index 59446d57f..000000000 --- a/docs/troubleshooting/microphone-error.mdx +++ /dev/null @@ -1,38 +0,0 @@ ---- -sidebar_position: 2 -title: "Troubleshooting Microphone Access" ---- - -Ensuring your application has the proper microphone access is crucial for functionality that depends on audio input. This guide covers how to manage and troubleshoot microphone permissions, particularly under secure contexts. - -## Understanding Secure Contexts 🔒 - -For security reasons, accessing the microphone is restricted to pages served over HTTPS or locally from `localhost`. This requirement is meant to safeguard your data by ensuring it is transmitted over secure channels. - -## Common Permission Issues 🚫 - -Browsers like Chrome, Brave, Microsoft Edge, Opera, and Vivaldi, as well as Firefox, restrict microphone access on non-HTTPS URLs. This typically becomes an issue when accessing a site from another device within the same network (e.g., using a mobile phone to access a desktop server). Here's how you can manage these issues: - -### Solutions for Non-HTTPS Connections - -1. **Set Up HTTPS:** - - It is highly recommended to configure your server to support HTTPS. This not only resolves permission issues but also enhances the security of your data transmissions. - -2. **Temporary Browser Flags (Use with caution):** - - These settings force your browser to treat certain insecure URLs as secure. This is useful for development purposes but poses significant security risks. Here's how to adjust these settings for major browsers: - - #### Chromium-based Browsers (e.g., Chrome, Brave) - - Open `chrome://flags/#unsafely-treat-insecure-origin-as-secure`. - - Enter your non-HTTPS address (e.g., `http://192.168.1.35:3000`). - - Restart the browser to apply the changes. - - #### Firefox-based Browsers - - Open `about:config`. - - Search and modify (or create) the string value `dom.securecontext.allowlist`. - - Add your IP addresses separated by commas (e.g., `http://127.0.0.1:8080`). - -### Considerations and Risks 🚨 - -While browser flags offer a quick fix, they bypass important security checks which can expose your device and data to vulnerabilities. Always prioritize proper security measures, especially when planning for a production environment. - -By following these best practices, you can ensure that your application properly accesses the microphone while maintaining the security and integrity of your data. \ No newline at end of file