{"id":4414,"date":"2021-02-19T06:34:24","date_gmt":"2021-02-19T06:34:24","guid":{"rendered":"https:\/\/www.aiproblog.com\/index.php\/2021\/02\/19\/voice-cloning-corentins-improvisation-on-sv2tts\/"},"modified":"2021-02-19T06:34:24","modified_gmt":"2021-02-19T06:34:24","slug":"voice-cloning-corentins-improvisation-on-sv2tts","status":"publish","type":"post","link":"https:\/\/www.aiproblog.com\/index.php\/2021\/02\/19\/voice-cloning-corentins-improvisation-on-sv2tts\/","title":{"rendered":"Voice Cloning: Corentin&#8217;s Improvisation On SV2TTS"},"content":{"rendered":"<p>Author: Mahmud Hossain Farsim<\/p>\n<div>\n<p><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8570748477?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8570748477?profile=RESIZE_710x\" width=\"720\" class=\"align-full\"><\/a><\/p>\n<p><span style=\"font-weight: 400;\">Working with the audio production and engineering industry, I often wonder how the future of the voice talent market will look like with the assistance of artificial intelligence. The development of cloning technology started a while back but it did not reach today&#8217;s level overnight. The debate of misusing this technology is another conversation that is not the agenda of this article. However, I would like to say that there are more opportunities out there than fears. And just like any other technology, we need to know how it works to explore the avenues and detect embezzlement.<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Statistical Parametric Speech Synthesis (SPSS), WaveNet, and Tacotron played a huge role in the development of Text To Speech models. In 2019, Google researchers published a paper called &#8220;Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)&#8221; which was assumed to be the state of the art algorithm for voice cloning. It describes a framework for zero-shot voice cloning that only requires 5 seconds of reference speech. The three stages of SV2TTS are a speaker encoder, a synthesizer, and a vocoder. However, the implementation of this paper was not out there until the work of Corentin Jemine, a student from the University of Li\u00e8ge. Corentin wrote his Master&#8217;s Degree thesis on SV2TTS called &#8220;Real-time Voice Cloning&#8221; with some improvisation and also implemented a user interface. His work later led to his side project called &#8220;Resemble&#8221; which runs as a voice cloning solution for various platforms.\u00a0<\/span><\/p>\n<\/p>\n<p><span style=\"text-decoration: underline;\"><b>Mathematical Intuition and algorithm<\/b><\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In a dataset of utterances of a speaker, Corentin denotes<\/span> <b><i>u<\/i><\/b><b><i>ij<\/i><\/b> <b><i>\u00a0<\/i><\/b><span style=\"font-weight: 400;\">as<\/span> <b><i>jth<\/i><\/b> <span style=\"font-weight: 400;\">utterance of the<\/span> <b><i>ith<\/i><\/b> <span style=\"font-weight: 400;\">speaker. And by<\/span> <b><i>x<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">, he denotes the log-mel spectrogram of the utterance<\/span> <b><i>u<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">. A log-mel spectrogram is a function that extracts speech features from a waveform. The encoder<\/span> <b><i>E<\/i><\/b> <span style=\"font-weight: 400;\">computes the embedding<\/span> <b><i>e<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">=<\/span> <b><i>E<\/i><\/b><span style=\"font-weight: 400;\">(<\/span><b><i>x<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">;<\/span> <b><i>wE<\/i><\/b> <span style=\"font-weight: 400;\">) corresponding to the utterance<\/span> <b><i>u<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">, where<\/span> <b><i>wE<\/i><\/b> <span style=\"font-weight: 400;\">are the parameters of the encoder. A speaker embedding<\/span> <span style=\"font-weight: 400;\">is the centroid of the embeddings of the speaker\u2019s utterances. This is the mathematical model of the embedding :<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u00a0\u00a0<a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569858084?profile=original\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569858084?profile=RESIZE_710x\" class=\"align-center\" width=\"127\" height=\"70\"><\/a><\/span><\/p>\n<p><b><i>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/i><\/b><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now<\/span><b>,<\/b> <span style=\"font-weight: 400;\">the synthesizer\u00a0<\/span> <b><i>S<\/i><\/b> <span style=\"font-weight: 400;\">will try to model<\/span> <b><i>x<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">given<\/span> <b><i>c<\/i><\/b><b><i>i<\/i><\/b> <span style=\"font-weight: 400;\">and<\/span> <b><i>t<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">, the transcript of utterance<\/span> <b><i>u<\/i><\/b><b><i>ij<\/i><\/b><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have<\/span> <b><i>x^<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">=<\/span> <b><i>S(c<\/i><\/b><b><i>i<\/i><\/b><b><i>, t<\/i><\/b><b><i>ij<\/i><\/b> <b><i>; w<\/i><\/b><b><i>S<\/i><\/b><span style=\"font-weight: 400;\">).\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In Corentin&#8217;s implementation, he uses the utterance embedding rather than the speaker embedding,<\/span> <span style=\"font-weight: 400;\">giving instead\u00a0<\/span> <b><i>x^<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">=<\/span> <b><i>S<\/i><\/b><span style=\"font-weight: 400;\">(<\/span><b><i>u<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">,<\/span> <b><i>t<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">;<\/span> <b><i>w<\/i><\/b><b><i>S<\/i><\/b><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The Vocoder<\/span> <b><i>V<\/i><\/b> <span style=\"font-weight: 400;\">will approximate<\/span> <b><i>u<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">given<\/span> <b><i>x^<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">. We have<\/span> <b><i>u<\/i><\/b><span style=\"font-weight: 400;\">\u02c6<\/span><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">=<\/span> <b><i>V<\/i><\/b><span style=\"font-weight: 400;\">(<\/span><b><i>x^<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">;<\/span> <b><i>wV<\/i><\/b><span style=\"font-weight: 400;\">).<\/span> <b><i>wV<\/i><\/b> <span style=\"font-weight: 400;\">is the parameter in this case. This objective model will be minimizing the loss, where<\/span> <b><i>L<\/i><\/b><b><i>V\u00a0<\/i><\/b> <span style=\"font-weight: 400;\">will be denoting the loss function in the waveform domain\u00a0 :\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><b><i>min<\/i><\/b><b><i>wE,wS ,wV<\/i><\/b> <b><i>L<\/i><\/b><b><i>V<\/i><\/b><b><i>(u<\/i><\/b><b><i>ij<\/i><\/b> <b><i>, V(S(E<\/i><\/b><span style=\"font-weight: 400;\">(<\/span><b><i>x<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">;<\/span> <b><i>wE<\/i><\/b> <span style=\"font-weight: 400;\">)<\/span><b><i>, t<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">;<\/span> <b><i>w<\/i><\/b><b><i>S<\/i><\/b><b><i>); wV))<\/i><\/b><\/p>\n<p><span style=\"font-size: 8pt;\"><br \/><\/span> <span style=\"font-weight: 400;\">However, Corentin proposes another loss function that can take less time to train :\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><b><i>min<\/i><\/b><b><i>wS<\/i><\/b><b><i>L<\/i><\/b><b><i>S<\/i><\/b><b>(<\/b><b><i>x<\/i><\/b><b><i>ij<\/i><\/b> <b>,<\/b> <b><i>S<\/i><\/b><b>(<\/b><b><i>e<\/i><\/b><b><i>ij<\/i><\/b> <b>,<\/b> <b><i>t<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">;<\/span> <b><i>w<\/i><\/b><b><i>S<\/i><\/b><b>))<\/b><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">He suggests training the synthesizer and vocoder separately. Assuming a pre-trained encoder, the synthesizer can be trained to directly predict the mel spectrograms of the target audio. Then the Vocoder is trained directly on spectrograms that can take both the ground truth or the synth generated spectrograms.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p style=\"text-align: center;\"><b><i>min<\/i><\/b><b><i>wV<\/i><\/b> <b><i>L<\/i><\/b><b><i>V<\/i><\/b> <b><i>(u<\/i><\/b><b><i>ij<\/i><\/b> <b><i>, V(x<\/i><\/b><b><i>ij<\/i><\/b> <b><i>; wV))\u00a0<\/i><\/b><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\">Or<\/span><\/p>\n<p style=\"text-align: center;\"><b><i>\u00a0 min<\/i><\/b><b><i>wV<\/i><\/b> <b><i>L<\/i><\/b><b><i>V<\/i><\/b> <b><i>(u<\/i><\/b><b><i>ij<\/i><\/b> <b><i>, V(x\u02c6<\/i><\/b><b><i>ij<\/i><\/b> <b><i>; wV))<\/i><\/b><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Now the Speaker Encoder. This encoder produces embeddings that can characterize the voice in the utterance. Having no ground truth as a reference, it needs to have the corresponding upsampling model for prediction.<\/span><\/p>\n<\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569870256?profile=original\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569870256?profile=RESIZE_710x\" class=\"align-center\" width=\"394\" height=\"192\"><\/a><\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-size: 8pt;\"><br \/><\/span> <span style=\"font-weight: 400;\">The sequential three-stage training of SV2TTS. Source: Corentin&#8217;s thesis paper<\/span><\/p>\n<p><span style=\"font-size: 8pt;\"><\/p>\n<p><\/span><\/p>\n<p><span style=\"font-weight: 400;\">With the GE2E loss function, the speaker encoder and synthesizer are trained separately. The synthesizer needs to have the embeddings from a trained encoder and when not trained on the ground truth, spectrograms from a trained synthesizer needs to be fed into the vocoder.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">\u00a0<\/span><span style=\"font-size: 8pt;\"><br \/><\/span> <b>Let&#8217;s take a look at the architecture and modification by Corentin on that three-stage pipeline.<\/b><\/p>\n<\/p>\n<p><b><span style=\"text-decoration: underline;\">Speaker Encoder<\/span>: <span style=\"font-weight: 400;\">Corentin used 256 units LSTM layers to build this encoder to make the training load lighter. However, In the original paper by Google, they trained the model in 50 million steps. There were 40 channel spectrograms as inputs. Corentin improvised with a ReLu layer before the L2 normalization in the last layer which is a vector of 256 elements. After speaker embedding with short utterances, The GE2E loss function optimizes the model. The models compute the embeddings in this way :\u00a0<\/span><\/b><\/p>\n<p><b><i>e<\/i><\/b><b><i>ij<\/i><\/b> <b>(1 \u2264<\/b> <b><i>i<\/i><\/b> <b>\u2264<\/b> <b><i>N<\/i><\/b><b>, 1 \u2264<\/b> <b><i>j<\/i><\/b> <b>\u2264<\/b> <b><i>M<\/i><\/b> <b>)<\/b> <span style=\"font-weight: 400;\">of M utterances of fixed duration from N speakers.\u00a0<\/span><\/p>\n<\/p>\n<p><span style=\"font-weight: 400;\">A two-by-two comparison through a similarity matrix of all embeddings<\/span> <b><i>e<\/i><\/b><b><i>ij<\/i><\/b> <span style=\"font-weight: 400;\">against every speaker embedding<\/span><\/p>\n<p><b><i>c<\/i><\/b><b><i>k<\/i><\/b> <b><i>(1 \u2264 k \u2264 N)<\/i><\/b> <span style=\"font-weight: 400;\">in the batch looks like this :\u00a0<\/span><\/p>\n<p><b><i>S<\/i><\/b><b><i>ij<\/i><\/b><b><i>,k = w \u00b7 cos(e<\/i><\/b><b><i>ij<\/i><\/b> <b><i>, c<\/i><\/b><b><i>k<\/i><\/b><b><i>) + b = w \u00b7 e<\/i><\/b><b><i>ij<\/i><\/b> <b><i>\u00b7 ||c<\/i><\/b><b><i>k<\/i><\/b> <b><i>||2 + b<\/i><\/b><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">High similarity values are expected in an optimal model. To optimize, the loss will be calculated by row-wise Softmax losses. When embedding, the utterances are included in the centroid of the same speaker. To avoid the bias that is created towards the speaker independently of the model accuracy, an utterance that is compared against its own speaker&#8217;s embedding will be removed from the speaker embedding. The similarity matrix will look like this :\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569879284?profile=original\" target=\"_blank\" rel=\"noopener\"><img decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569879284?profile=RESIZE_710x\" width=\"411\" class=\"align-center\"><\/a><\/span><\/p>\n<p><span style=\"font-size: 8pt;\"><br \/><\/span> <b><i>c<\/i><\/b><b><i>i<\/i><\/b><b><i>(\u2212j)<\/i><\/b> <span style=\"font-weight: 400;\">is the exclusive centroid here. 1.6 seconds of sample utterances are taken from long datasets. The model segments the utterances of 1.6 seconds and the encoder passes the utterances individually. The mean output is then normalized to generate utterance embedding. Corentin proposes to keep 1.6 seconds both for inference and training. He also stresses vectorization in all operations when computing the similarity matrix for efficient and fast computation.\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For experimenting with the speaker encoder, Corentin preprocessed the samples and removed the silent parts of the utterance samples with the help of a python package. He used the dilation technique (s+1) where S is the maximum silent duration tolerated. Note that the value of S was 0.2s in his experiment. Finally, he normalized the audio waveform. He used all the same datasets used by the Google authors &#8211; LibriSpeech, VoxCeleb1, and VoxCeleb2; except another one &#8211; an internal set which Corentn did not have the access to use. These datasets included thousands of hours and speakers and celebrities collected from clean and noisy recordings as well as from youtube.\u00a0\u00a0<\/span><\/p>\n<\/p>\n<p><span style=\"font-weight: 400;\">After training the speaker encoder 1 million steps where he sees that the loss decreases slowly, Corentin observed how well the model can cluster the speakers. He uses UMAP to project the embeddings and the model separated gender and speakers linearly in the projected space.\u00a0<\/span><\/p>\n<\/p>\n<p><span style=\"text-decoration: underline;\"><b>Synthesizer<\/b><\/span> <b>:<\/b> <span style=\"font-weight: 400;\">The synthesizer is Tacotron2 without Wavenet. It&#8217;s a recurrent sequence to sequence model that predicts spectrograms from the text. SV2TTS however, did a modification in the process &#8211; a speaker embedding is concatenated to every frame that is output by the Tacotron encoder but in the Tacotron methodology, frames are passed through a bidirectional LSTM. Concatenated vector goes through two unidirectional LSTM layers and then projected to a single mel spectrogram frame.<\/span><\/p>\n<\/p>\n<p><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569881687?profile=original\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569881687?profile=RESIZE_710x\" class=\"align-center\" width=\"405\" height=\"245\"><\/a><\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\">The Modified Tacotron. Source: Corentin&#8217;s Thesis Paper<\/span><\/p>\n<p><span style=\"font-size: 8pt;\"><\/p>\n<p><\/span><\/p>\n<p><span style=\"font-weight: 400;\">Corentin improvised by replacing abbreviations and numbers by their complete-textual form, making all characters ASCII, normalizing whitespaces, and making all characters\u00a0<\/span><span style=\"font-weight: 400;\">lowercase. When implemented, Corentin used the LibriSpeech dataset that he thought would give the best voice cloning similarity on unseen speakers. Mentionable results, however, did not come using the LibriTTS dataset. Also, he used the Montreal Forced Aligner for Automatic Speech Recognition and to reduce background noises from synthesized spectrograms, the LogMMSE algorithm.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In the case of embeddings to train the synthesizer, Corentin preferred to use utterance embeddings of the same target utterance in his experiment instead of what is proposed in the SV2TTS paper.\u00a0<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569884275?profile=original\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569884275?profile=RESIZE_710x\" class=\"align-center\" width=\"430\" height=\"188\"><\/a><\/span><span style=\"font-weight: 400; font-size: 8pt;\">Encoder and Decoder steps aligned (Left). Resulting in a more smooth version of the predicted spectrogram than the ground truth Mel.<\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400; font-size: 8pt;\">Source: Corentin&#8217;s thesis paper<\/span><\/p>\n<p><span style=\"font-size: 8pt;\"><\/p>\n<p><\/span><\/p>\n<p><span style=\"font-weight: 400;\">The results were excellent aside from the pauses in some cases. This happened due to the slow-talking speakers embedding. Also, Corentin&#8217;s limits on the duration of utterances in the dataset (1.6s &#8211; 11.25s) were likely to cause those issues.\u00a0\u00a0<\/span><\/p>\n<\/p>\n<p><b><span style=\"text-decoration: underline;\">Vocoder<\/span>: <span style=\"font-weight: 400;\">Corentin uses a pytorch implementation of WaveRNN as the vocoder which was improvised by github user fatchord. Although the authors of Sv2TTS improvised on the computation time and overhead of the computation by implementing the sampling operation as a custom GPU operation. Corentin didn&#8217;t do that. He used a batched sampling with alternative WaveRNN.\u00a0\u00a0<\/span><\/b><\/p>\n<\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Corentin finds that for the utterances shorter than 12.5 seconds, this alternative \u00a0 \u00a0 \u00a0 WaveRNN will run slowly. The time it takes to infer is highly dependent on the number of folds in batched sampling. In the sparse WaveRNN model, Corentin finds that the matrix multiply operation for a sparse matrix and a dense vector-only reach at an equal state timewise, with the dense &#8211; only matrix multiplication for levels of sparsity above 91%. The sparse tensors will slow down the forward pass speed below this threshold. This implementation finds that a sparsity level of 96.4% would lower the real-time threshold to 7.86 seconds, and a level of 97.8% to 4.44 seconds.<\/span><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"text-decoration: underline;\"><b>Conclusion<\/b><\/span><\/p>\n<p><span style=\"font-weight: 400;\">Without a doubt, Corentin Jemine did a great job implementing the SV2TTS paper with some of his own thoughts. As his thesis was published more than a year ago, more recent developments must be there already. But I found his thesis paper as a breakthrough in this technology and I am sure it has helped to shape the latest innovation of voice cloning. I will encourage everyone to read his paper in this link :<\/span> <a href=\"https:\/\/matheo.uliege.be\/handle\/2268.2\/6801\"><span style=\"font-weight: 400;\">https:\/\/matheo.uliege.be\/handle\/2268.2\/6801<\/span><\/a><\/p>\n<p><span style=\"font-size: 8pt;\">\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Corentin also made a user interface for this project which he demonstrates in this youtube link:<\/span> <a href=\"https:\/\/www.youtube.com\/watch?v=-O_hYhToKoA\"><span style=\"font-weight: 400;\">https:\/\/www.youtube.com\/watch?v=-O_hYhToKoA<\/span><\/a><\/p>\n<\/p>\n<p><span style=\"font-size: 8pt;\"><a href=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569887071?profile=original\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/storage.ning.com\/topology\/rest\/1.0\/file\/get\/8569887071?profile=RESIZE_710x\" class=\"align-center\" width=\"383\" height=\"225\"><\/a><\/span><\/p>\n<p style=\"text-align: center;\"><span style=\"font-weight: 400;\">SV2TTS Toolbox: The user interface by Corentin Jemine<\/span><\/p>\n<p><span style=\"font-size: 8pt;\"><\/p>\n<p><\/span><\/p>\n<p><span style=\"font-weight: 400;\">Corentin also mentioned in his youtube comment that &#8220;Resemble&#8221;, another project by him, which came after this thesis, can produce better results than what he could achieve in his experiment and invites everyone to use that instead. However, I particularly loved his ideas on some improvisation that he did on the original SV2TTS paper which opened the door for innovation. For example, his speaker encoder model worked well with Equal Error Rate to be<\/span> <b>4.5%<\/b> <span style=\"font-weight: 400;\">which is<\/span> <b>5.64%<\/b> <span style=\"font-weight: 400;\">lower than the Google researchers&#8217; who trained their model 50 times more than Corentin&#8217;s !\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Needless to say that this technology will go beyond our imagination in near future. New marketplaces will grow, media, entertainment, and information industries will be acting as key stakeholders of this technology. Voiceover talents will find new ways to get themselves heard in so many different ways. Also, this is only a matter of time that the reverse engineering of the voice cloning models will show up very soon. By wiping out the &#8216;deep fakes&#8217;, hopefully, those models will flatten all the tensions and make way for voice cloning technology towards the greater good.\u00a0<\/span><\/p>\n<\/div>\n<p><a href=\"https:\/\/www.datasciencecentral.com\/xn\/detail\/6448529:BlogPost:1031177\">Go to Source<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Author: Mahmud Hossain Farsim Working with the audio production and engineering industry, I often wonder how the future of the voice talent market will look [&hellip;] <span class=\"read-more-link\"><a class=\"read-more\" href=\"https:\/\/www.aiproblog.com\/index.php\/2021\/02\/19\/voice-cloning-corentins-improvisation-on-sv2tts\/\">Read More<\/a><\/span><\/p>\n","protected":false},"author":1,"featured_media":461,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"footnotes":""},"categories":[26],"tags":[],"_links":{"self":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4414"}],"collection":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/comments?post=4414"}],"version-history":[{"count":0,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/posts\/4414\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media\/470"}],"wp:attachment":[{"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/media?parent=4414"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/categories?post=4414"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aiproblog.com\/index.php\/wp-json\/wp\/v2\/tags?post=4414"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}