This study paper describes Tacotron 2, a neural network architecture for speech synthesis right from textual content.
Below, the authors existing the framework and the general composition of the technique, detailing all the significant methods expected for productive simple implementation. The discussed technique involves two factors: recurrent sequence-to-sequence attribute prediction network and a modified version of WaveNet utilised to produce time-domain waveforms from mel-scale spectrograms. This textual content also analyzes the training set up and the approach of audio top quality evaluation .
The code implementations of the proposed technique can be located right here.
The technique is primarily based on a recurrent sequence-to-sequence attribute prediction network that maps character embeddings to mel-scale spectrograms, adopted by a modified WaveNet model performing as a vocoder to synthesize time-domain waveforms from these spectrograms. Our model achieves a imply opinion rating (MOS) of 4.53 similar to a MOS of 4.58 for professionally recorded