语音合成经典模型结构介绍
创始人
2024-01-16 23:13:52
0

(以下内容搬运自 PaddleSpeech)

Models introduction

TTS system mainly includes three modules: Text Frontend, Acoustic model and Vocoder. We introduce a rule-based Chinese text frontend in cn_text_frontend.md. Here, we will introduce acoustic models and vocoders, which are trainable.

The main processes of TTS include:

  1. Convert the original text into characters/phonemes, through the text frontend module.
  2. Convert characters/phonemes into acoustic features, such as linear spectrogram, mel spectrogram, LPC features, etc. through Acoustic models.
  3. Convert acoustic features into waveforms through Vocoders.

A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders.

Acoustic Models

Modeling Objectives of Acoustic Models

Modeling the mapping relationship between text sequences and speech features:

text X = {x1,...,xM}
specch Y = {y1,...yN}

Modeling Objectives:

Ω = argmax p(Y|X,Ω)

Modeling process of Acoustic Models

At present, there are two mainstream acoustic model structures.

  • Frame level acoustic model:
    • Duration model (M Tokens - > N Frames).
    • Acoustic decoder (N Frames - > N Frames).

  • Sequence to sequence acoustic model:
    • M Tokens - > N Frames.

Tacotron2

Tacotron is the first end-to-end acoustic model based on deep learning, and it is also the most widely used acoustic model.

Tacotron2 is the Improvement of Tacotron.

Tacotron

Features of Tacotron:

  • Encoder.
    • CBHG.
    • Input: character sequence.
  • Decoder.
    • Global soft attention.
    • unidirectional RNN.
    • Autoregressive teacher force training (input real speech feature).
    • Multi frame prediction.
    • CBHG postprocess.
    • Vocoder: Griffin-Lim.

Advantage of Tacotron:

  • No need for complex text frontend analysis modules.
  • No need for an additional duration model.
  • Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.

Disadvantages of Tacotron:

  • The CBHG is complex and the amount of parameters is relatively large.
  • Global soft attention.
  • Poor stability for speech synthesis tasks.
  • In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
  • Phase problem in Griffin-Lim causes speech distortion during wave reconstruction.
  • The autoregressive decoder cannot be stopped during the generation process.

Tacotron2

Features of Tacotron2:

  • Reduction of parameters.
    • CBHG -> PostNet (3 Conv layers + BLSTM or 5 Conv layers).
    • remove Attention RNN.
  • Speech distortion caused by Griffin-Lim.
    • WaveNet.
  • Improvements of PostNet.
    • CBHG -> 5 Conv layers.
    • The input and output of the PostNet calculate L2 loss with real Mel spectrogram.
    • Residual connection.
  • Bad stop in an autoregressive decoder.
    • Predict whether it should stop at each moment of decoding (stop token).
    • Set a threshold to determine whether to stop generating when decoding.
  • Stability of attention.
    • Location-aware attention.
    • The alignment matrix of the previous time is considered at step t of the decoder.

You can find PaddleSpeech TTS’s tacotron2 with LJSpeech dataset example at examples/ljspeech/tts0.

TransformerTTS

Disadvantages of the Tacotrons:

  • Encoder and decoder are relatively weak at global information modeling
    • Vanishing gradient of RNN.
    • Fixed-length context modeling problem in CNN kernel.
  • Training is relatively inefficient.
  • The attention is not robust enough and the stability is poor.

Transformer TTS is a combination of Tacotron2 and Transformer.

Transformer

Transformer is a seq2seq model based entirely on an attention mechanism.

Features of Transformer:

  • Encoder.
    • N blocks based on self-attention mechanism.
    • Positional Encoding.
  • Decoder.
    • N blocks based on self-attention mechanism.
    • Add Mask to the self-attention in blocks to cover up the information after the t step.
    • Attentions between encoder and decoder.
    • Positional Encoding.

Transformer TTS

Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.

Motivations:

  • RNNs in Tacotron2 make the inefficiency of training.
  • Vanishing gradient of RNN makes the model’s ability to model long-term contexts weak.
  • Self-attention doesn’t contain any recursive structure which can be trained in parallel.
  • Self-attention can model global context information well.

Features of Transformer TTS:

  • Add conv based PreNet in encoder and decoder.
  • Stop Token in decoder controls when to stop autoregressive generation.
  • Add PostNet after decoder to improve the quality of synthetic speech.
  • Scaled position encoding.
    • Uniform scale position encoding may have a negative impact on input or output sequences.

Disadvantages of Transformer TTS:

  • The ability of position encoding for timing information is still relatively weak.
  • The ability to perceive local information is weak, and local information is more related to pronunciation.
  • Stability is worse than Tacotron2.

You can find PaddleSpeech TTS’s Transformer TTS with LJSpeech dataset example at examples/ljspeech/tts1.

FastSpeech2

Disadvantage of seq2seq models:

  • In the seq2seq model based on attention, no matter how to improve the attention mechanism, it’s difficult to avoid generation errors in the decoding stage.

Frame-level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame-level mapping does not have the uncertainty of sequence generation.

In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.

FastSpeech

Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, FastSpeech is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.

Features of FastSpeech:

  • Encoder: based on Transformer.
  • Change FFN to CNN in self-attention.
    • Model local dependency.
  • Length regulator.
    • Use real phoneme durations to expand the output frame of the encoder during training.
  • Non-autoregressive decode.
    • Improve generation efficiency.

Length predictor:

  • Pretrain a TransformerTTS model.
  • Get alignment matrix of train data.
  • Calculate the phoneme durations according to the probability of the alignment matrix.
  • Use the output of the encoder to predict the phoneme durations and calculate the MSE loss.
  • Use real phoneme durations to expand the output frame of the encoder during training.
  • Use phoneme durations predicted by the duration model to expand the frame during prediction.
    • Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is 1 during training).

Advantages of non-autoregressive decoder:

  • The built-in duration model of the seq2seq model has converted the input length M to the output length N.
  • The length of the output is known, stop token is no longer used, avoiding the problem of being unable to stop.
    • Can be generated in parallel (decoding time is less affected by sequence length)

FastPitch

FastPitch follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.


FastSpeech2

Disadvantages of FastSpeech:

  • The teacher-student distillation pipeline is complicated and time-consuming.
  • The duration extracted from the teacher model is not accurate enough.
  • The target mel spectrograms distilled from the teacher model suffer from information loss due to data simplification.

FastSpeech2 addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.

Features of FastSpeech2:

  • Directly train the model with the ground-truth target instead of the simplified output from the teacher.
  • Introducing more variation information of speech as conditional inputs, extract duration, pitch, and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.

FastSpeech2 is similar to FastPitch but introduces more variation information of the speech.


You can find PaddleSpeech TTS’s FastSpeech2/FastPitch with CSMSC dataset example at examples/csmsc/tts3, We use token-averaged pitch and energy values introduced in FastPitch rather than frame-level ones in FastSpeech2.

SpeedySpeech

SpeedySpeech simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.

Features of SpeedySpeech:

  • Use a simpler, smaller, and faster-to-train convolutional teacher model (Deepvoice3 and DCTTS) with a single attention layer instead of Transformer used in FastSpeech.
  • Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
  • Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.

You can find PaddleSpeech TTS’s SpeedySpeech with CSMSC dataset example at examples/csmsc/tts2.

Vocoders

In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform.

Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.

A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimates the parameters, and then the decoder restores the speech.

Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.

Categories of neural vocodes

  • Autoregression

    • WaveNet
    • WaveRNN
    • LPCNet
  • Flow

    • WaveFlow
    • WaveGlow
    • FloWaveNet
    • Parallel WaveNet
  • GAN

    • WaveGAN
    • Parallel WaveGAN
    • MelGAN
    • Style MelGAN
    • Multi Band MelGAN
    • HiFi GAN
  • VAE

    • Wave-VAE
  • Diffusion

    • WaveGrad
    • DiffWave

Motivations of GAN-based vocoders:

  • Modeling speech signals by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
  • Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a slow generation speed.
  • The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long-term context information.
  • Vocoders based on Bipartite Transformation converge slowly and are complex.
  • GAN-based vocoders don’t need to make assumptions about the speech distribution and train through adversarial learning.

Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.

WaveFlow

WaveFlow is proposed by Baidu Research.

Features of WaveFlow:

  • It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on an Nvidia V100 GPU without engineered inference kernels, which is faster than WaveGlow and several orders of magnitude faster than WaveNet.
  • It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smaller than WaveGlow (87.9M).
  • It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in Parallel WaveNet and ClariNet, which simplifies the training pipeline and reduces the cost of development.

You can find PaddleSpeech TTS’s WaveFlow with LJSpeech dataset example at examples/ljspeech/voc0.

Parallel WaveGAN

Parallel WaveGAN trains a non-autoregressive WaveNet variant as a generator in a GAN-based training method.

Features of Parallel WaveGAN:

  • Use non-causal convolution instead of causal convolution.
  • The input is random Gaussian white noise.
  • The model is non-autoregressive both in training and prediction, which is fast
  • Multi-resolution STFT loss.

You can find PaddleSpeech TTS’s Parallel WaveGAN with CSMSC example at examples/csmsc/voc1.

P.S. 欢迎关注我们的 github repo PaddleSpeech, 是基于飞桨 PaddlePaddle 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型。

相关内容

热门资讯

关于春天的优美诗歌朗诵稿5篇 春天的到来,整个世界都恢复了生机,万物复苏,百花盛开。关于春天的诗歌朗诵你知道有哪些吗?下面是小编整...
依然是你诗歌 依然是你诗歌  是“你”  这样逗留在我的文字里不肯离去  成为我蓦然回首时的`那盏灯火  是“你”...
八一的励志诗词 关于八一的励志诗词  诗词,是指以古体诗、近体诗和格律词为代表的中国汉族传统诗歌。亦是汉字文化圈的特...
3月21日是什么节日 3月21日是什么节日  据说3月21日是节日最多的一天。知道今天是世界睡眠日的是普通青年;知道今天是...
唯美的外国现代诗歌 唯美的外国现代诗歌  在日常学习、工作或生活中,大家都看到过许多经典的诗歌吧,诗歌是表现诗人思想感情...
诗经采薇教学 采问答访稿推荐度:诗经中的女孩灵动名字1500个推荐度:教学总结推荐度:教学反思推荐度:《翠鸟》教学...
天国的记忆诗歌 天国的记忆诗歌  在日常学习、工作和生活中,大家一定没少看到经典的诗歌吧,诗歌具有精炼含蓄的特点,起...
拉萨河之恋诗歌 拉萨河之恋诗歌  拉萨河之恋  记得,那是个正午  敌不过念乡之苦  茫然不知所措的我  像一片绿叶...
思念是一种莫名的痛诗歌外 思念是一种莫名的痛诗歌外一首  今夜我一如既往等你  夜已深等你良久  相思的苦痛爬满心底  你终究...
天兵怒气冲霄汉 “天兵怒气冲霄汉”出处 出自 现代 毛泽东 的《渔家傲·反第一次大“围剿”》“天兵怒气冲霄汉”平仄韵...
陶渊明比较著名的诗词 陶渊明比较著名的诗词(精选10首)   陶渊明是东晋大诗人,也是中国历史上成就最高的诗人之一。宋代的...
唐诗之李商隐:无题·其二 唐诗三百首之李商隐:无题·其二唐诗三百首全集《无题·其二》作者:李商隐飒飒东风细雨来,芙蓉塘外有轻雷...
鲁迅《淡淡的血痕中》原文及读... 鲁迅《淡淡的血痕中》原文及读后感  【鲁迅《淡淡的血痕中》原文】  —纪念几个死者和生者和未生者— ...
适合儿童的唐诗 适合儿童的唐诗  唐诗的'辉煌成就,引起后人学习的兴趣和研究的热望。下面就是小编给大家整理的幼儿唐诗...
嫦娥 李商隐 嫦娥 李商隐嫦娥 李商隐,此其诗诗题为“嫦娥”,实际上抒写的是处境孤寂的主人公对于环境的感受和心灵独...
咏秋古诗词 中秋将至,秋风猎猎,秋高气爽。作为我国的传统节日,古代诗词中有很多千古传颂的咏秋之作,以诗抒情怀,以...
谷雨的古诗句 谷雨的古诗句  谷雨的雨,现身在古代文人墨客笔下,是相当愁人的。小编今天为大家带来关于谷雨的.古诗句...
表达奉献精神的诗句 表达奉献精神的诗句  在日常学习、工作和生活中,大家总少不了接触一些耳熟能详的诗句吧,诗句能使人们自...
经典爱情现代诗   经典爱情现代诗  1、《炉中煤》  啊,我年青的女郎!  我不辜负你的殷勤,  你也不要辜负了我...
描写兰花的诗句分享 描写兰花的诗句分享  丛兰生幽谷,莓莓遍林薄。不纫亦何伤,已胜当门托。辇至逾关山,滋培珍几阁。  冬...