Speech and Voice Processing Theory
Speech and voice processing enables machines to understand, generate, and manipulate human speech. This chapter covers ASR, TTS, VAD, diarization, and voice cloning.
Speech Processing Pipeline
┌──────────┐ ┌─────┐ ┌─────────────┐ ┌──────────┐
│ Audio │───▶│ VAD │───▶│ ASR/Speaker │───▶│ Output │
│ Input │ │ │ │ Recognition │ │ Text/ID │
└──────────┘ └─────┘ └─────────────┘ └──────────┘
Voice Activity Detection (VAD)
Detect when speech is present in audio:
Energy-Based VAD
Simple threshold on frame energy:
energy[t] = Σ(samples[t:t+frame]²)
is_speech[t] = energy[t] > threshold
Pros: Fast, no model needed Cons: Sensitive to noise
Neural VAD (Silero-style)
Audio → Mel Spectrogram → LSTM/Conv → [0.0, 1.0]
Speech probability
Pros: Robust to noise Cons: Requires model inference
VAD Parameters
| Parameter | Typical Value | Effect |
|---|---|---|
| Frame length | 20-30ms | Resolution |
| Threshold | 0.5 | Sensitivity |
| Min speech | 250ms | Filter noise |
| Min silence | 300ms | Merge segments |
Automatic Speech Recognition (ASR)
Convert speech to text:
Traditional Pipeline
Audio → MFCC → Acoustic Model → HMM → Language Model → Text
End-to-End (Whisper-style)
Audio → Mel Spectrogram → Encoder → Decoder → Text
│ │ │
└──────────────────────────┘
Transformer Architecture
Whisper Architecture
Audio (30s max)
│
▼
Mel Spectrogram (80 mel, 3000 frames)
│
▼
┌─────────────────────┐
│ Encoder │ (Transformer)
│ - Conv stem │
│ - Positional enc │
│ - N layers │
└─────────────────────┘
│
▼
┌─────────────────────┐
│ Decoder │ (Transformer)
│ - Text tokens │
│ - Cross-attention │
│ - Autoregressive │
└─────────────────────┘
│
▼
Text tokens → Text
Word-Level Timestamps
Cross-attention alignment:
For each word:
1. Find decoder step that generated word
2. Extract cross-attention weights
3. Find peak attention position
4. Map to audio timestamp
Speaker Diarization
"Who spoke when?"
Pipeline
Audio → VAD → Embedding → Clustering → Timeline
│ │
▼ ▼
Speaker Vectors Speakers
Speaker Embeddings
X-Vector:
Audio → Frame features → Statistics pooling → DNN → 512-dim
ECAPA-TDNN:
Audio → SE-Res2Net → Attentive Stats → 192-dim
Clustering Methods
| Method | Requires K? | Notes |
|---|---|---|
| K-Means | Yes | Simple, fast |
| Spectral | Yes | Better for non-spherical |
| Agglomerative | No | Can auto-detect speakers |
| VBx | No | Bayesian, state-of-the-art |
Text-to-Speech (TTS)
Convert text to speech:
Two-Stage Pipeline
Text → Acoustic Model → Mel Spectrogram → Vocoder → Waveform
│ │
▼ ▼
Tacotron/FastSpeech HiFi-GAN/WaveGlow
FastSpeech 2
Non-autoregressive for fast synthesis:
Phonemes → Encoder → Variance Adaptor → Mel Decoder → Mel
│
Duration, Pitch, Energy predictors
Variance Adaptor:
- Duration: How long each phoneme
- Pitch: F0 contour
- Energy: Loudness
Vocoders
Convert mel spectrogram to waveform:
| Vocoder | Quality | Speed |
|---|---|---|
| Griffin-Lim | Low | Fast |
| WaveNet | High | Very slow |
| HiFi-GAN | High | Fast |
| WaveGlow | High | Moderate |
Voice Cloning
Clone a voice from samples:
Zero-Shot Cloning (YourTTS)
Reference Audio → Speaker Encoder → Style Vector
│
▼
Text → TTS Model ─────────────────────▶ Cloned Speech
Only needs 3-5 seconds of reference audio.
Fine-Tuning Based
- Pre-train TTS on large corpus
- Fine-tune on target speaker (15-30 min audio)
- Generate with fine-tuned model
Trade-off: Better quality, more data needed
Voice Conversion
Change voice identity while preserving content:
PPG-Based
Source Audio → ASR → PPG (Content) ─────┐
│
Target Speaker → Embedding ────────────▶│───▶ Converted
│
Prosody extraction ────────────────────┘
PPG = Phonetic Posteriorgram (content representation)
Autoencoder-Based
Audio → Content Encoder → Content ─────┐
│
Audio → Speaker Encoder → Speaker ────▶│───▶ Decoder → Audio'
│
Audio → Prosody Encoder → Prosody ────┘
Voice Isolation
Separate voice from background:
Spectral Subtraction
Y(f) = Speech(f) + Noise(f)
Speech(f) ≈ Y(f) - E[Noise(f)]
Estimate noise from silent segments.
Neural Source Separation
Mixture → U-Net/Conv-TasNet → Separated Sources
│
Mask estimation per source
Speaker Verification
"Is this the claimed speaker?"
Pipeline
Enrollment: Audio → Embedding Model → Reference Vector
│
▼
Verification: Audio → Embedding Model → Query Vector
│
▼
Cosine Similarity
│
▼
Accept/Reject
Metrics
| Metric | Description |
|---|---|
| EER | Equal Error Rate (FAR = FRR) |
| minDCF | Detection cost function |
| TAR@FAR | True accept at fixed false accept |
Prosody Transfer
Transfer speaking style:
Source Audio → Style Encoder → Style Vector
│
┌────────────────┘
▼
Target Audio → TTS → New Audio with Source Style
Style includes:
- Speaking rate
- Pitch patterns
- Emphasis
- Emotion
Quality Metrics
| Metric | Measures | Range |
|---|---|---|
| WER | ASR accuracy | 0-∞ (lower=better) |
| MOS | Subjective quality | 1-5 |
| PESQ | Perceptual quality | -0.5 to 4.5 |
| STOI | Intelligibility | 0-1 |
References
- Radford, A., et al. (2023). "Robust Speech Recognition via Large-Scale Weak Supervision." (Whisper)
- Ren, Y., et al. (2020). "FastSpeech 2." ICLR.
- Kong, J., et al. (2020). "HiFi-GAN." NeurIPS.
- Desplanques, B., et al. (2020). "ECAPA-TDNN." Interspeech.