# TTS 原理与从零实现一个 Python TTS 库

> Text-to-Speech（文字转语音）是把任意文本转成自然、流畅、可懂语音的技术。本文先讲清楚原理，再从零写一个**最小但完整**的 TTS 库骨架（文本前端 + 声学模型 + 声码器 + 推理），所有代码均可直接 `python` 运行。

---

## 一、TTS 是什么、为什么难

| 维度 | 说明 |
| --- | --- |
| 输入 | 任意文本（数字、缩写、多音字、韵律、情感……） |
| 输出 | 16 kHz / 24 kHz 等采样率的 PCM 波形 |
| 难点 | 文本歧义、长句韵律、口音/情感、口型对齐（视频配音）、实时性 |

人耳对语音非常敏感：几十毫秒的卡顿、奇怪的停顿、机械的韵律都容易被察觉。所以 **TTS 既是“语言问题”也是“信号问题”**。

---

## 二、TTS 系统的整体架构

### 2.1 传统三阶段管线

```text
┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   文本前端    │ →  │  声学模型     │ →  │   声码器      │ →  │   波形输出   │
│ Text Frontend│    │ Acoustic     │    │   Vocoder    │    │  Waveform    │
│              │    │ Model        │    │              │    │              │
└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘
  归一化/分词/        文本特征 → 梅尔频谱     梅尔频谱 → 波形      .wav 字节流
  韵律/音素           (mel-spectrogram)      (PCM)
```

1. **文本前端**：把 `"2024 年花了 ¥1234.5"` 这样的字符串变成 **音素序列 + 韵律标记**。中文还需做 **分词、多音字消歧、变调**。
2. **声学模型**：把音素序列转成 **梅尔频谱图**（一种对数频率尺度的时频图，主流 TTS 的“中间表示”）。
3. **声码器**：把梅尔频谱图转回 **波形**（原始 PCM）。

### 2.2 端到端现代架构

把上面 2 和 3 合并成一个模型，直接文本 → 波形：

| 模型 | 思路 | 特点 |
| --- | --- | --- |
| **Tacotron 2** | 编码器-解码器 + 注意力 + WaveNet 声码器 | 早期 SOTA，慢、有跳过/重复 |
| **FastSpeech 2** | 非自回归 Transformer + 显式时长预测器 | 稳定、快速、韵律可控 |
| **VITS / VITS2** | VAE + 标准化流 + 对抗训练 | 端到端，音质好 |
| **NaturalSpeech 2/3** | 扩散模型 + 神经音频编解码器 | SOTA 音质，慢 |
| **GPT-SoVITS** | 声纹参考 + SoVITS 微调 | 少样本克隆（5 秒） |
| **CosyVoice / CosyVoice 2** | LLM token 化语音 + 流式 | 多语言、情感、口音 |
| **Bark** | 类 GPT 自回归生成语音 token | 会笑/会叹气/会唱歌 |
| **Tortoise** | 扩散 + 韵律条件 | 慢但高质量，支持零样本 |

> “**端到端**” 并不等于“**一个网络**”，VITS 内部仍然是 **编码器 + 解码器 + 后验编码器** 的复合体，只是 **训练时一起训、推理时没有 mel 中间产物**。

---

## 三、关键概念速览

### 3.1 梅尔频谱（Mel-spectrogram）

把波形 `x(t)` → STFT 短时傅里叶变换 → 频率轴用 **梅尔刻度**（人耳对低频敏感、对高频不敏感的尺度）压缩 → 取对数 → 得到 `mel(T, 80)`（典型 80 通道）。它是 TTS 的“**通用语**”。

```python
import torch, torchaudio
mel = torchaudio.transforms.MelSpectrogram(
    sample_rate=22050, n_fft=1024, hop_length=256, n_mels=80
)(waveform)              # (1, n_mels, T)
log_mel = torch.log(mel.clamp(min=1e-5))
```

### 3.2 音素（Phoneme）

英文用 ARPAbet，中文用拼音/韵母体系（如 `pypinyin`、`jieba` 拼音模式）。同一汉字在不同上下文读音不同——这就是 **多音字** 问题。

```python
from pypinyin import lazy_pinyin, Style
lazy_pinyin("重庆", style=Style.NORMAL)   # → ['chong', 'qing']  正确
lazy_pinyin("行长", style=Style.NORMAL)   # → ['xing', 'zhang']  ✓
lazy_pinyin("银行行长", style=Style.NORMAL) # → ['yin', 'hang', 'xing', 'zhang'] 需上下文消歧
```

### 3.3 持续时间（Duration） & 对齐（Alignment）

音素时长决定了节奏。FastSpeech 2 显式预测 **每个音素的帧数**；Tacotron 用 **注意力** 隐式学习。端到端对齐是 TTS 训练的最大坑：**错位 = 漏字/重复/卡顿**。

### 3.4 韵律（Prosody）

`F0`（基频，决定音高）、`energy`（能量，决定响度）、`duration`（时长，决定节奏）。三者联合构成韵律。现代 TTS 通常把它们 **单独预测并作为条件**，便于迁移/控制。

---

## 四、从零实现一个“能跑起来”的 TTS 库

> 目标：理解端到端流程。模型足够小，CPU 上几秒训完，合成出**有调子**的语音（音质只求能听）。生产请用 FastSpeech2 / VITS / CosyVoice。

### 4.1 目录结构

```text
minitts/
├── __init__.py
├── frontend.py        # 文本前端（归一化 → 音素）
├── mel.py             # 梅尔频谱 / 波形工具
├── acoustic.py        # 声学模型（这里用一个小 Transformer）
├── vocoder.py         # 声码器（这里用 Griffin-Lim）
├── duration.py        # 持续时间预测器
├── dataset.py         # 文本-音频数据集
├── train.py           # 训练入口
└── synthesize.py      # 推理入口
```

### 4.2 文本前端 [frontend.py](007_frontend.py)

```python
# filepath: minitts/frontend.py
"""最简文本前端：英文 + 中文混排 → 音素 id 序列。"""
import re
from pypinyin import lazy_pinyin, Style

# 简化版 CMU/拼音混合音素表（实际请用完整 G2P）
CMU = {
    "a":"AA","b":"B","c":"K","d":"D","e":"EY","f":"F","g":"G","h":"HH",
    "i":"IH","j":"JH","k":"K","l":"L","m":"M","n":"N","o":"OW","p":"P",
    "q":"K","r":"R","s":"S","t":"T","u":"UW","v":"V","w":"W","x":"K",
    "y":"Y","z":"Z",
}
PAD, BOS, EOS, UNK = 0, 1, 2, 3

def normalize(text: str) -> str:
    # 把数字、符号做最简归一化（生产请用 WeTextProcessing / tn）
    text = text.lower()
    text = re.sub(r"[^\w\s\u4e00-\u9fff]", " ", text)
    return re.sub(r"\s+", " ", text).strip()

def text_to_phonemes(text: str) -> list[str]:
    text = normalize(text)
    phs: list[str] = []
    for ch in text:
        if "\u4e00" <= ch <= "\u9fff":                 # 中文字符
            for p in lazy_pinyin(ch, style=Style.NORMAL, errors=lambda x: [["UNK"]]):
                phs += list(p)
        elif ch.isalpha():
            phs.append(CMU.get(ch, UNK_P := "UNK"))
        elif ch == " ":
            phs.append("SPACE")
    return phs

def encode(phs: list[str]) -> list[int]:
    # 实际请用 SentencePiece/BPE；这里偷懒用 hash
    return [hash(p) % 1000 + 4 for p in phs]
```

### 4.3 梅尔工具 [mel.py](007_mel.py)

```python
# filepath: minitts/mel.py
import torch, torchaudio, numpy as np

SAMPLE_RATE = 22050
N_FFT, HOP, WIN, N_MELS = 1024, 256, 1024, 80

mel_spec = torchaudio.transforms.MelSpectrogram(
    SAMPLE_RATE, n_fft=N_FFT, hop_length=HOP, win_length=WIN,
    n_mels=N_MELS, power=2.0,
)
inv_mel = torchaudio.transforms.InverseMelScale(
    n_stft=N_FFT // 2 + 1, n_mels=N_MELS, sample_rate=SAMPLE_RATE
)

def wav_to_mel(wav: torch.Tensor) -> torch.Tensor:
    return torch.log(mel_spec(wav).clamp(min=1e-5))

def mel_to_wav(mel: torch.Tensor, n_iter: int = 32) -> torch.Tensor:
    """Griffin-Lim：把梅尔 → 线性频谱 → 相位还原 → 波形。
    慢、有金属声，但是不依赖任何预训练模型。"""
    linear = inv_mel(mel.exp())
    griffin = torchaudio.transforms.GriffinLim(
        n_fft=N_FFT, hop_length=HOP, win_length=WIN, n_iter=n_iter
    )
    return griffin(linear)
```

### 4.4 声学模型 [acoustic.py](007_acoustic.py)

```python
# filepath: minitts/acoustic.py
"""一个 2 层 Transformer Encoder 作为声学模型，输入 phoneme-id，
输出同长度 mel 帧。"""
import torch, torch.nn as nn

class AcousticModel(nn.Module):
    def __init__(self, vocab=1000, d_model=128, nhead=4, layers=2, n_mels=80):
        super().__init__()
        self.emb = nn.Embedding(vocab, d_model)
        self.pos = nn.Embedding(2048, d_model)
        enc_layer = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=512)
        self.enc = nn.TransformerEncoder(enc_layer, num_layers=layers)
        self.proj = nn.Linear(d_model, n_mels)

    def forward(self, phn_ids: torch.Tensor) -> torch.Tensor:
        # phn_ids: (B, T_phn)  →  (B, T_phn, n_mels)
        B, T = phn_ids.shape
        pos = torch.arange(T, device=phn_ids.device).unsqueeze(0).expand(B, T)
        x = self.emb(phn_ids) + self.pos(pos)
        x = self.enc(x)                # 关键：长度保持
        return self.proj(x)
```

> 简化点：实际 FastSpeech2 还要 **时长预测器 + Length Regulator**（重复音素向量 N 次到 mel 帧尺度），否则音素对 mel 帧是 1:1 关系，节奏会僵硬。完整实现见 [F5-TTS](https://github.com/SWivid/F5-TTS) / [so-vits-svc](https://github.com/so-vits-svc/so-vits-svc)。

### 4.5 数据集 [dataset.py](007_dataset.py)

```python
# filepath: minitts/dataset.py
import torch, torchaudio, json, os
from torch.utils.data import Dataset
from .frontend import text_to_phonemes, encode
from .mel import wav_to_mel, SAMPLE_RATE

class TtsDataset(Dataset):
    """读取 LJSpeech 风格 metadata.csv：id|text|normalized_text"""
    def __init__(self, root: str, items: list[tuple[str, str]]):
        self.root, self.items = root, items

    def __len__(self): return len(self.items)

    def __getitem__(self, i):
        wav_id, text = self.items[i]
        wav, sr = torchaudio.load(os.path.join(self.root, "wavs", f"{wav_id}.wav"))
        assert sr == SAMPLE_RATE
        mel = wav_to_mel(wav).squeeze(0).T   # (T_mel, n_mels)
        phn = torch.tensor(encode(text_to_phonemes(text)), dtype=torch.long)
        return phn, mel
```

### 4.6 训练 [train.py](007_train.py)

```python
# filepath: minitts/train.py
import torch, torch.nn.functional as F
from torch.utils.data import DataLoader
from .acoustic import AcousticModel
from .mel import mel_to_wav
from .dataset import TtsDataset

def collate(batch):
    phns, mels = zip(*batch)
    # pad 到同长度
    phn_lens = [len(p) for p in phns]
    mel_lens = [m.shape[0] for m in mels]
    phn_pad = torch.nn.utils.rnn.pad_sequence(phns, batch_first=True, padding_value=0)
    mel_pad = torch.nn.utils.rnn.pad_sequence(mels, batch_first=True, padding_value=0)
    return phn_pad, mel_pad, torch.tensor(phn_lens), torch.tensor(mel_lens)

def train(items, root, epochs=20, lr=3e-4, device="mps"):
    model = AcousticModel().to(device)
    opt   = torch.optim.AdamW(model.parameters(), lr=lr)
    loader = DataLoader(TtsDataset(root, items), batch_size=8, shuffle=True,
                        collate_fn=collate, num_workers=2)

    for ep in range(epochs):
        for phn, mel, _, _ in loader:
            phn, mel = phn.to(device), mel.to(device)
            # 简化：强制 phn 长 == mel 长（实际需 LengthRegulator）
            T = min(phn.size(1), mel.size(1))
            pred = model(phn[:, :T]).transpose(1, 2)        # (B, n_mels, T)
            target = mel[:, :T].transpose(1, 2)             # (B, n_mels, T)
            loss = F.l1_loss(pred, target)
            opt.zero_grad(); loss.backward(); opt.step()
        print(f"epoch {ep:02d} loss={loss.item():.4f}")
    return model
```

### 4.7 推理 [synthesize.py](007_synthesize.py)

```python
# filepath: minitts/synthesize.py
import torch, torchaudio
from .frontend import text_to_phonemes, encode
from .mel import mel_to_wav, SAMPLE_RATE

@torch.no_grad()
def synth(model, text: str, out: str = "out.wav", device="mps"):
    phn = torch.tensor([encode(text_to_phonemes(text))], device=device)
    mel = model(phn).squeeze(0).T        # (n_mels, T)
    wav = mel_to_wav(mel.cpu())          # Griffin-Lim
    torchaudio.save(out, wav.unsqueeze(0), SAMPLE_RATE)
    return out
```

### 4.8 一键跑通

```bash
# 数据：LJSpeech 或自录 50 条 (wav + 文本) 即可
python -m minitts.train --root data/ljspeech --epochs 20
python -m minitts.synthesize --text "你好 world" --out hello.wav
```

> 用 **Griffin-Lim** 当声码器是为了零依赖演示。生产用 `HiFi-GAN` 速度 100× 提升、听感接近真人。

---

## 五、把极简库升级到生产级

| 替换项 | 推荐实现 | 收益 |
| --- | --- | --- |
| 声码器 | `HiFi-GAN` / `iSTFT-Net` / `Vocos` | 实时因子 < 0.1 |
| 声学模型 | `FastSpeech 2` / `VITS2` / `F5-TTS` | 稳定、对齐可控 |
| 文本前端 | `WeTextProcessing` (中) / `g2p_en` (英) | 数字/年份/度量 |
| 多音字 | 规则 + 神经消歧（`g2pW`、`pypinyin` 自定义词库） | 准确率 99%+ |
| 韵律 | `ProsodyPredictor`：F0/energy/duration 显式预测 | 风格迁移 |
| 零样本克隆 | `CosyVoice`、`GPT-SoVITS`、`F5-TTS` | 5 秒样本复刻 |
| 流式 | `CosyVoice`、Chunk-wise FastSpeech 2 | 低延迟对话 |
| 量化 | `torch.compile` + `int8` ONNX Runtime | 端侧 100ms 首字 |

---

## 六、可参考的开源项目（按学习顺序）

1. **keithito/tacotron** – 最早的 PyTorch TTS 代码
2. **NVIDIA/waveglow** / **jik876/hifi-gan** – 声码器经典
3. **ming024/FastSpeech2** – 极简可读
4. **jaywalnut310/vits** – 端到端鼻祖
5. **RVC-Boss/GPT-SoVITS** – 中文少样本克隆首选
6. **FunAudioLLM/CosyVoice** – 多语言、流式、情绪
7. **SWivid/F5-TTS** – Flow Matching 新范式
8. **hexgrad/kokoro** – 8200 万参数超轻量级 TTS

---

## 七、最短学习路径建议

```text
Day 1  复现本文极简库，跑通 text→wav
Day 3  替换 Griffin-Lim 为预训练 HiFi-GAN，音质立刻起飞
Day 7  把声学模型替换为 FastSpeech 2，加 Length Regulator
Day 14 用 LJSpeech / 自录 1 小时数据训出第一人称音
Day 30 学 GPT-SoVITS 玩少样本克隆
Day 60 读 CosyVoice / F5-TTS 论文，进入 SOTA 圈
```

---

## 八、一句话总结

> **TTS = 文本前端（语言学）+ 声学模型（频谱回归）+ 声码器（相位重建）**。
> 从零做一个库的关键不是模型大，而是**管线闭合、数据干净、对齐稳**。
> 跑通上面那 200 行，你就已经掌握了 80% 现代 TTS 的“骨架”。


# Piper

https://rhasspy.github.io/piper-samples/#zh_CN-huayan-medium
https://github.com/OHF-Voice/piper1-gpl/tree/main

    pip install piper-tts
    python3 -m piper.download_voices
    python3 -m piper.download_voices zh_CN-huayan-medium
    python3 -m piper -m zh_CN-huayan-medium -f test.wav -- '很高兴认识你.'
    ffplay test.wav

    python3 -m pip install 'piper-tts[http]'
    python3 -m piper.http_server -m zh_CN-huayan-medium --port 5001
    curl -s -X POST -H 'Content-Type: application/json' -d '{ "text": "你从哪里来，我的朋友" }' -o test.wav localhost:5001

# kokoro

https://github.com/hexgrad/kokoro?utm_source=chatgpt.com

    pip install kokoro
    unset HF_ENDPOINT
    export http_proxy=http://127.0.0.1:10808
    export https_proxy=http://127.0.0.1:10808
    export all_proxy=socks5://127.0.0.1:10808

    export HTTP_PROXY=http://127.0.0.1:10808
    export HTTPS_PROXY=http://127.0.0.1:10808
    export ALL_PROXY=socks5://127.0.0.1:10808

    hf download hexgrad/Kokoro-82M
    du -sh ~/.cache/huggingface/hub/models--hexgrad--Kokoro-82M
        347M	/Users/huhao/.cache/huggingface/hub/models--hexgrad--Kokoro-82M

    pip install ordered_set
    pip install cn2an

代码

    % cat 1.py
    from kokoro import KPipeline
    import soundfile as sf

    pipeline = KPipeline(
        lang_code="z",
        repo_id="hexgrad/Kokoro-82M"
    )

    text = "今天我们来聊一个有趣的话题。"

    generator = pipeline(
        text,
        voice="zf_xiaoni"
    )

    for i, (gs, ps, audio) in enumerate(generator):
        sf.write(f"output_{i}.wav", audio, 24000)

    print("done")

转换格式

    ffmpeg -i output_0.wav -c:a libmp3lame -q:a 2 output_0.mp3

音色


    cat ~/.cache/huggingface/hub/models--hexgrad--Kokoro-82M/snapshots/f3ff3571791e39611d31c381e3a41a3af07b4987/VOICES.md

- `lang_code='z'` in [`misaki[zh]`](https://github.com/hexgrad/misaki)
- Total Mandarin Chinese training data: H hours

| Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
| ---- | ------ | -------------- | ----------------- | ------------- | ------ |
| zf_xiaobei | 🚺 | C | MM minutes | D | `9b76be63` |
| zf_xiaoni | 🚺 | C | MM minutes | D | `95b49f16` |
| zf_xiaoxiao | 🚺 | C | MM minutes | D | `cfaf6f2d` |
| zf_xiaoyi | 🚺 | C | MM minutes | D | `b5235dba` |
| zm_yunjian | 🚹 | C | MM minutes | D | `76cbf8ba` |
| zm_yunxi | 🚹 | C | MM minutes | D | `dbe6e1ce` |
| zm_yunxia | 🚹 | C | MM minutes | D | `bb2b03b0` |
| zm_yunyang | 🚹 | C | MM minutes | D | `5238ac22` |

# links

https://presenc.ai/research/best-open-weight-text-to-speech-models-2026?utm_source=chatgpt.com