HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis 논문 정리

Hα ყҽσɳɠ·2022년 4월 10일

Vocoder

목록 보기

3/3

J. Kong et al., “HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis,” NIPS, 2020.
카카오에서 투고한 논문!

Higher computational efficiency and improved sample quality than AR or flow-based models
Proposing a discriminator which consists of small sub-discriminators, each of which obtains only a specific periodic parts of raw waveforms

Fully convolutional neural network
Multi-Receptive Filed Fusion (MRF)
- Observing patterns of various lengths in parallel
- Output: sum of outputs of multiple residual blocks

Figure 1: The generator upsamples mel-spectrograms up to |ku| times to match the temporal resolution of raw waveforms. A MRF module adds features from |kr| residual blocks of different kernel sizes and dilation rates. Lastly, the n-th residual block with kernel size kr[n] and dilation rates Dr[n] in a MRF module is depicted.

Speech audio consists of sinusoidal signals with various periods
The importance of knowing the patterns of various periods that underlie speech data

Figure 2: (a) The second sub-discriminator of MSD. (b) The second sub-discriminator of MPD with
period 3.

Mixture of sub-discriminators
Accepting only equally spaced input audio samples
Designed to capture different implicit structures by looking at different parts of the input audio

MPD's sub-discriminators using only decomposed samples
Addition of MSD to evaluate continuous speech
= Consists of 3 sub-discriminators operating on different input scales

Convert the binary cross-entropy part of GAN to least square loss function for non-vanishing gradient flow [X. mao et al., 2017]

To increase the efficiency and improve the quality of the generated speech
L1 distance between the mel-spectrogram of a waveform synthesized by the generator and that of a ground truth waveform

L1 distance between a ground truth sample and a conditionally generated sample
A learned similarity metric measured as the difference in discriminator features between the ground truth and generated samples