Skip to content
Merged

audio #929

Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
133 changes: 116 additions & 17 deletions docs/features/audio/speech-to-text/env-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,20 +11,119 @@ For a complete list of all Open WebUI environment variables, see the [Environmen

:::

The following is a summary of the environment variables for speech to text (STT).

# Environment Variables For Speech To Text (STT)

| Variable | Description |
|----------|-------------|
| `WHISPER_MODEL` | Sets the Whisper model to use for local Speech-to-Text |
| `WHISPER_MODEL_DIR` | Specifies the directory to store Whisper model files |
| `WHISPER_COMPUTE_TYPE` | Sets the compute type for Whisper model inference (e.g., `int8`, `float16`) |
| `WHISPER_LANGUAGE` | Specifies the ISO 639-1 (ISO 639-2 for Hawaiian and Cantonese) Speech-to-Text language to use for Whisper (language is predicted unless set) |
| `AUDIO_STT_ENGINE` | Specifies the Speech-to-Text engine to use (empty for local Whisper, or `openai`) |
| `AUDIO_STT_MODEL` | Specifies the Speech-to-Text model for OpenAI-compatible endpoints |
| `AUDIO_STT_OPENAI_API_BASE_URL` | Sets the OpenAI-compatible base URL for Speech-to-Text |
| `AUDIO_STT_OPENAI_API_KEY` | Sets the OpenAI API key for Speech-to-Text |
| `AUDIO_STT_AZURE_API_KEY` | Sets the Azure API key for Speech-to-Text |
| `AUDIO_STT_AZURE_REGION` | Sets the Azure region for Speech-to-Text |
| `AUDIO_STT_AZURE_LOCALES` | Sets the Azure locales for Speech-to-Text |
The following is a summary of the environment variables for speech to text (STT) and text to speech (TTS).

:::tip UI Configuration
Most of these settings can also be configured in the **Admin Panel → Settings → Audio** tab. Environment variables take precedence on startup but can be overridden in the UI.
:::

## Speech To Text (STT) Environment Variables

### Local Whisper

| Variable | Description | Default |
|----------|-------------|---------|
| `WHISPER_MODEL` | Whisper model size | `base` |
| `WHISPER_MODEL_DIR` | Directory to store Whisper model files | `{CACHE_DIR}/whisper/models` |
| `WHISPER_COMPUTE_TYPE` | Compute type for inference (see note below) | `int8` |
| `WHISPER_LANGUAGE` | ISO 639-1 language code (empty = auto-detect) | empty |
| `WHISPER_MODEL_AUTO_UPDATE` | Auto-download model updates | `false` |
| `WHISPER_VAD_FILTER` | Enable Voice Activity Detection filter | `false` |

:::info WHISPER_COMPUTE_TYPE Options
- `int8` — CPU default, fastest but may not work on older GPUs
- `float16` — **Recommended for CUDA/GPU**
- `int8_float16` — Hybrid mode (int8 weights, float16 computation)
- `float32` — Maximum compatibility, slowest

If using the `:cuda` Docker image with an older GPU, set `WHISPER_COMPUTE_TYPE=float16` to avoid errors.
:::

### OpenAI-Compatible STT

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_STT_ENGINE` | STT engine: empty (local Whisper), `openai`, `azure`, `deepgram`, `mistral` | empty |
| `AUDIO_STT_MODEL` | STT model for external providers | empty |
| `AUDIO_STT_OPENAI_API_BASE_URL` | OpenAI-compatible API base URL | `https://api.openai.com/v1` |
| `AUDIO_STT_OPENAI_API_KEY` | OpenAI API key | empty |
| `AUDIO_STT_SUPPORTED_CONTENT_TYPES` | Comma-separated list of supported audio MIME types | empty |

### Azure STT

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_STT_AZURE_API_KEY` | Azure Cognitive Services API key | empty |
| `AUDIO_STT_AZURE_REGION` | Azure region | `eastus` |
| `AUDIO_STT_AZURE_LOCALES` | Comma-separated locales (e.g., `en-US,de-DE`) | auto |
| `AUDIO_STT_AZURE_BASE_URL` | Custom Azure base URL (optional) | empty |
| `AUDIO_STT_AZURE_MAX_SPEAKERS` | Max speakers for diarization | `3` |

### Deepgram STT

| Variable | Description | Default |
|----------|-------------|---------|
| `DEEPGRAM_API_KEY` | Deepgram API key | empty |

### Mistral STT

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_STT_MISTRAL_API_KEY` | Mistral API key | empty |
| `AUDIO_STT_MISTRAL_API_BASE_URL` | Mistral API base URL | `https://api.mistral.ai/v1` |
| `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS` | Use chat completions endpoint | `false` |

## Text To Speech (TTS) Environment Variables

### General TTS

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_TTS_ENGINE` | TTS engine: empty (disabled), `openai`, `elevenlabs`, `azure`, `transformers` | empty |
| `AUDIO_TTS_MODEL` | TTS model | `tts-1` |
| `AUDIO_TTS_VOICE` | Default voice | `alloy` |
| `AUDIO_TTS_SPLIT_ON` | Split text on: `punctuation` or `none` | `punctuation` |
| `AUDIO_TTS_API_KEY` | API key for ElevenLabs or Azure TTS | empty |

### OpenAI-Compatible TTS

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_TTS_OPENAI_API_BASE_URL` | OpenAI-compatible TTS API base URL | `https://api.openai.com/v1` |
| `AUDIO_TTS_OPENAI_API_KEY` | OpenAI TTS API key | empty |
| `AUDIO_TTS_OPENAI_PARAMS` | Additional JSON params for OpenAI TTS | empty |

### Azure TTS

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_TTS_AZURE_SPEECH_REGION` | Azure Speech region | `eastus` |
| `AUDIO_TTS_AZURE_SPEECH_BASE_URL` | Custom Azure Speech base URL (optional) | empty |
| `AUDIO_TTS_AZURE_SPEECH_OUTPUT_FORMAT` | Audio output format | `audio-24khz-160kbitrate-mono-mp3` |

## Tips for Configuring Audio

### Using Local Whisper STT

For GPU acceleration issues or older GPUs, try setting:
```yaml
environment:
- WHISPER_COMPUTE_TYPE=float16
```

### Using External TTS Services

When running Open WebUI in Docker with an external TTS service:

```yaml
environment:
- AUDIO_TTS_ENGINE=openai
- AUDIO_TTS_OPENAI_API_BASE_URL=http://host.docker.internal:5050/v1
- AUDIO_TTS_OPENAI_API_KEY=your-api-key
```

:::tip
Use `host.docker.internal` on Docker Desktop (Windows/Mac) to access services on the host. On Linux, use the host IP or container networking.
:::

For troubleshooting audio issues, see the [Audio Troubleshooting Guide](/troubleshooting/audio).
125 changes: 125 additions & 0 deletions docs/features/audio/speech-to-text/mistral-voxtral-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
---
sidebar_position: 2
title: "Mistral Voxtral STT"
---

# Using Mistral Voxtral for Speech-to-Text

This guide covers how to use Mistral's Voxtral model for Speech-to-Text with Open WebUI. Voxtral is Mistral's speech-to-text model that provides accurate transcription.

## Requirements

- A Mistral API key
- Open WebUI installed and running

## Quick Setup (UI)

1. Click your **profile icon** (bottom-left corner)
2. Select **Admin Panel**
3. Click **Settings** → **Audio** tab
4. Configure the following:

| Setting | Value |
|---------|-------|
| **Speech-to-Text Engine** | `MistralAI` |
| **API Key** | Your Mistral API key |
| **STT Model** | `voxtral-mini-latest` (or leave empty for default) |

5. Click **Save**

## Available Models

| Model | Description |
|-------|-------------|
| `voxtral-mini-latest` | Default transcription model (recommended) |

## Environment Variables Setup

If you prefer to configure via environment variables:

```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
environment:
- AUDIO_STT_ENGINE=mistral
- AUDIO_STT_MISTRAL_API_KEY=your-mistral-api-key
- AUDIO_STT_MODEL=voxtral-mini-latest
# ... other configuration
```

### All Mistral STT Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_STT_ENGINE` | Set to `mistral` | empty (uses local Whisper) |
| `AUDIO_STT_MISTRAL_API_KEY` | Your Mistral API key | empty |
| `AUDIO_STT_MISTRAL_API_BASE_URL` | Mistral API base URL | `https://api.mistral.ai/v1` |
| `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS` | Use chat completions endpoint | `false` |
| `AUDIO_STT_MODEL` | STT model | `voxtral-mini-latest` |

## Transcription Methods

Mistral supports two transcription methods:

### Standard Transcription (Default)
Uses the dedicated transcription endpoint. This is the recommended method.

### Chat Completions Method
Set `AUDIO_STT_MISTRAL_USE_CHAT_COMPLETIONS=true` to use Mistral's chat completions API for transcription. This method:
- Requires audio in mp3 or wav format (automatic conversion is attempted)
- May provide different results than the standard endpoint

## Using STT

1. Click the **microphone icon** in the chat input
2. Speak your message
3. Click the microphone again or wait for silence detection
4. Your speech will be transcribed and appear in the input box

## Supported Audio Formats

Voxtral accepts common audio formats. The system defaults to accepting `audio/*` and `video/webm`.

If using the chat completions method, audio is automatically converted to mp3.

## Troubleshooting

### API Key Errors

If you see "Mistral API key is required":
1. Verify your API key is entered correctly
2. Check the API key hasn't expired
3. Ensure your Mistral account has API access

### Transcription Not Working

1. Check container logs: `docker logs open-webui -f`
2. Verify the STT Engine is set to `MistralAI`
3. Try the standard transcription method (disable chat completions)

### Audio Format Issues

If using chat completions method and audio conversion fails:
- Ensure FFmpeg is available in the container
- Try recording in a different format (wav or mp3)
- Switch to the standard transcription method

For more troubleshooting, see the [Audio Troubleshooting Guide](/troubleshooting/audio).

## Comparison with Other STT Options

| Feature | Mistral Voxtral | OpenAI Whisper | Local Whisper |
|---------|-----------------|----------------|---------------|
| **Cost** | Per-minute pricing | Per-minute pricing | Free |
| **Privacy** | Audio sent to Mistral | Audio sent to OpenAI | Audio stays local |
| **Model Options** | voxtral-mini-latest | whisper-1 | tiny → large |
| **GPU Required** | No | No | Recommended |

## Cost Considerations

Mistral charges per minute of audio for STT. Check [Mistral's pricing page](https://mistral.ai/products/la-plateforme#pricing) for current rates.

:::tip
For free STT, use **Local Whisper** (the default) or the browser's **Web API** for basic transcription.
:::
136 changes: 136 additions & 0 deletions docs/features/audio/speech-to-text/openai-stt-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
---
sidebar_position: 0
title: "OpenAI STT Integration"
---

# Using OpenAI for Speech-to-Text

This guide covers how to use OpenAI's Whisper API for Speech-to-Text with Open WebUI. This provides cloud-based transcription without needing local GPU resources.

:::tip Looking for TTS?
See the companion guide: [Using OpenAI for Text-to-Speech](/features/audio/text-to-speech/openai-tts-integration)
:::

## Requirements

- An OpenAI API key with access to the Audio API
- Open WebUI installed and running

## Quick Setup (UI)

1. Click your **profile icon** (bottom-left corner)
2. Select **Admin Panel**
3. Click **Settings** → **Audio** tab
4. Configure the following:

| Setting | Value |
|---------|-------|
| **Speech-to-Text Engine** | `OpenAI` |
| **API Base URL** | `https://api.openai.com/v1` |
| **API Key** | Your OpenAI API key |
| **STT Model** | `whisper-1` |
| **Supported Content Types** | Leave empty for defaults, or set `audio/wav,audio/mpeg,audio/webm` |

5. Click **Save**

## Available Models

| Model | Description |
|-------|-------------|
| `whisper-1` | OpenAI's Whisper large-v2 model, hosted in the cloud |

:::info
OpenAI currently only offers `whisper-1`. For more model options, use Local Whisper (built into Open WebUI) or other providers like Deepgram.
:::

## Environment Variables Setup

If you prefer to configure via environment variables:

```yaml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
environment:
- AUDIO_STT_ENGINE=openai
- AUDIO_STT_OPENAI_API_BASE_URL=https://api.openai.com/v1
- AUDIO_STT_OPENAI_API_KEY=sk-...
- AUDIO_STT_MODEL=whisper-1
# ... other configuration
```

### All STT Environment Variables (OpenAI)

| Variable | Description | Default |
|----------|-------------|---------|
| `AUDIO_STT_ENGINE` | Set to `openai` | empty (uses local Whisper) |
| `AUDIO_STT_OPENAI_API_BASE_URL` | OpenAI API base URL | `https://api.openai.com/v1` |
| `AUDIO_STT_OPENAI_API_KEY` | Your OpenAI API key | empty |
| `AUDIO_STT_MODEL` | STT model | `whisper-1` |
| `AUDIO_STT_SUPPORTED_CONTENT_TYPES` | Allowed audio MIME types | `audio/*,video/webm` |

### Supported Audio Formats

By default, Open WebUI accepts `audio/*` and `video/webm` for transcription. If you need to restrict or expand supported formats, set `AUDIO_STT_SUPPORTED_CONTENT_TYPES`:

```yaml
environment:
- AUDIO_STT_SUPPORTED_CONTENT_TYPES=audio/wav,audio/mpeg,audio/webm
```

OpenAI's Whisper API supports: `mp3`, `mp4`, `mpeg`, `mpga`, `m4a`, `wav`, `webm`

## Using STT

1. Click the **microphone icon** in the chat input
2. Speak your message
3. Click the microphone again or wait for silence detection
4. Your speech will be transcribed and appear in the input box

## OpenAI vs Local Whisper

| Feature | OpenAI Whisper API | Local Whisper |
|---------|-------------------|---------------|
| **Latency** | Network dependent | Faster for short clips |
| **Cost** | Per-minute pricing | Free (uses your hardware) |
| **Privacy** | Audio sent to OpenAI | Audio stays local |
| **GPU Required** | No | Recommended for speed |
| **Model Options** | `whisper-1` only | tiny, base, small, medium, large |

Choose **OpenAI** if:
- You don't have a GPU
- You want consistent performance
- Privacy isn't a concern

Choose **Local Whisper** if:
- You want free transcription
- You need audio to stay private
- You have a GPU for acceleration

## Troubleshooting

### Microphone Not Working

1. Ensure you're using HTTPS or localhost
2. Check browser microphone permissions
3. See [Microphone Access Issues](/troubleshooting/audio#microphone-access-issues)

### Transcription Errors

1. Check your OpenAI API key is valid
2. Verify the API Base URL is correct
3. Check container logs for error messages

### Language Issues

OpenAI's Whisper API automatically detects language. If you need to force a specific language, consider using Local Whisper with the `WHISPER_LANGUAGE` environment variable.

For more troubleshooting, see the [Audio Troubleshooting Guide](/troubleshooting/audio).

## Cost Considerations

OpenAI charges per minute of audio for STT. See [OpenAI Pricing](https://platform.openai.com/docs/pricing) for current rates.

:::tip
For free STT, use **Local Whisper** (the default) or the browser's **Web API** for basic transcription.
:::
Loading