Speech-to-Text Providers
STT (Speech-to-Text) providers convert user voice input to text for processing.
Supported Providers
| Provider | Real-time | Accuracy | Pricing |
|---|---|---|---|
| OpenAI Whisper | No | Excellent | $0.006/min |
| Deepgram | Yes | Very Good | $0.0077/min |
Feature Comparison
| Feature | Whisper | Deepgram |
|---|---|---|
| Real-time streaming | ❌ | ✅ |
| Speaker diarization | ❌ | ✅ |
| Custom vocabulary | ❌ | ✅ |
| Punctuation | ✅ | ✅ |
| Multi-language | ✅ | ✅ |
| Word timestamps | ✅ | ✅ |
When to Use Each
Choose Whisper if:
- You prioritize transcription accuracy
- Latency isn't critical (batch mode is fine)
- You already have OpenAI API access
- Cost is the primary concern
Choose Deepgram if:
- You need real-time transcription
- You want streaming for live captions
- You need speaker identification
- You have specialized vocabulary
Cost Comparison
For a 3-minute conversation with ~1 minute of user speech:
| Provider | Cost |
|---|---|
| Whisper | $0.006 |
| Deepgram | $0.0077 |
Both are extremely affordable. The choice typically comes down to features rather than cost.