xVASynth Shenmue character voice models (Update 29/10/22)

Shendu · Jul 14, 2022

xVASynth is an AI tool for generating high-quality speech, which is available on Steam/Nexus for free.

I plan on training some voice models from the Shenmue series for xVASynth. It may be useful for anyone who wants to add custom voiced lines to the game(provided you can re-import the audio that is), or you could simply use it just to play around with.

A basic rundown on how to use the software.

Installation: Copy and extract the voice model zip into the root directory of xVASynth.

Update: Ren was a failure. Will have to retrain at some point.

Bug: models marked with a ' * ' have quality degradation issues when generating sentences in succession(might just be my system). As a temporary fix: generate audio for desired model, type in new sentence, change to another model and generate, then switch back.

Gui Zhang * added
Ine-san * added
Nozomi Harasaki * added
Mark Kimberly * added
Joy added
Shenhua added

NON-Shenmue:

Metal Gear Solid:
Colonel Campbell (Twin Snakes) added
Nastasha Romanenko (Twin Snakes) added
Naomi Hunter (MGS4) added
Raiden (MGRR) added
Sunny (MGRR) added

Dragon Ball:
Android 18 (Casual) added
Vegeta added

Resident Evil:
Jill Valentine (Resident Evil 3 1999) added
Leon S. Kennedy (RE4) added
Ingrid Hannigan (RE4) added

Mortal Kombat:
Cassie Cage added
Mileena added

Other:
ADA (Zone of the Enders) added
Madara Uchiha (Naruto) added
Balalaika (Black Lagoon) added
Alyx Vance (HL: A) added
Maleficent (Disney Infinity) added

Ryo Hazuki

Xiuying Hong

Ren (wip)

Gui Zhang *

Ine-san *

Nozomi Harasaki *

Mark Kimberly *

Joy

Shenhhua

METAL GEAR SOLID:

Solid Snake (MGS1)

Colonel Campbell (MGS1)

Colonel Campbell (Twin Snakes)

Meryl Silverburgh (MGS1)

Mei Ling (MGS1)

Nastasha Romanenko (MGS1)

Nastasha Romanenko (Twin Snakes)

Naomi Hunter (MGS1)

Naomi Hunter (Twin Snakes)

Naomi Hunter (MGS4)

The Boss (Snake Eater)

Drebin 893 (MGS4)

Raiden (MGRR)

Sunny (MGRR)

DRAGON BALL:

Caulifla

Android 18

Android 18 (Casual)

Hercule

Chi-Chi

Chi-Chi (Calm)

Vegeta

LEGACY OF KAIN:

Kain

Umah

TEKKEN:

Nina Williams

Craig Marduk

RESIDENT EVIL:

Claire Redfield (Resident Evil 2 1998)

Jill Valentine (Resident Evil 3 1999)

Leon S. Kennedy (RE4)

Ingrid Hannigan (RE4)

MORTAL KOMBAT:

Cassie Cage

Mileena

OTHER:

Yuna (Final Fantasy)

Lara Croft (Tomb Raider II & III)

D-Tritus (Scrapland)

Elise Schwarzer (The Legend of Heroes)

ADA (Zone of the Enders)

Madara Uchiha (Naruto)

Balalaika (Black Lagoon)

Alyx Vance (HL: A)

Maleficent (Disney Infinity)

15.59 MB folder on MEGA

57 files and 11 subfolders

mega.nz

Contains some references to Discord users. In the event that throws you off.

Shendu · Jul 14, 2022

If you have any issues with proper pronunciation, then that's because you have to enable the words you want to use in something called CMUDict. It's basically a large dictionary of words that tells the model how to says things using phonemes, which are the individual sounds used to construct words.

Mini-tutorial: How to enable words in CMUDict.

Image 1: Click on the 'ae' on the top-right of the main window.

image 2: Click CMUDict on the left.

Image 3: Type the word you want in the search box at the top and search through the list until you find it, then enable it by ticking the box on the left.

Image 4: It shouldn't be displayed in letter form like this.

Image 5: It should look like this. Displayed in phoneme form.

Tips:

Use the little guide shown on the right of CMUDict to construct your own words. e.g xiuying = SH UW1 IH0 NG . This sounds like 'shewing'.

Add spaces at the beginning and end of the sentence if you have any cutoff. This can also improve the sound of the sentence in general.

If a word sounds over or under-emphasized, adjust the phonemes of the letters by clicking on the right side of the pitch slider and dragging left or right to either shorten or lengthen it.

Alt + left-click word to select whole word to edit. e.g. 5.png

Ctrl + A to select all words for a universal edit.

Try alternating between different Vocoders(middle-right of the GUI). Bespoke HiFi GAN and Quick-and-dirty mostly. You may get better results.

Eishen · Jul 16, 2022

I'm very interested in this, Thank you very much for the tutorial. You can imitate all famous voices? Even voice actors from some anime?

Shendu · Jul 16, 2022

If you can obtain a sufficient amount of relatively clean audio files for the character, then in theory, yes. You can train a model from pretty much anything.

Ideally, ten minutes plus of voice samples is good, but most of the models I've posted have less than ten minutes worth. The Nina Williams model for instance has only 1 minute and 33 seconds worth of voice samples. Xiuying has 4 minutes 50 seconds.

Gravelly voices and high pitch voices can be difficult to get right, but I've been experimenting with it for a while and have had pretty decent success with various types of characters. Some of the models posted are the result of me experimenting with datasets, small, large etc. Some sound good, some sound awful.

These are v2 models and soon(hopefully) I will upgrade some of them to v3. I think v3 models have emotion control, but don't quote me on that.

Eishen · Jul 16, 2022

Awesome, thanks again, incredible what the technology is capable of these days.

Shendu · Jul 16, 2022

I agree. Let me know if you need any help.

Shendu · Jul 17, 2022

Would you all prefer to wait for me to complete the models? THEN upload it? Or would you like me to provide you with WIP Shenmue models to mess around with in the mean time, THEN release the full model? The quality likely won't be as good as the full model though and mileage may vary.

Training models is a slow process I'm afraid:

Stage 1 - 3: Give's you a basic model.(relatively quick)
Stage 4: Fine tuning to normalize speech patterns and make it less "pitchy" (SLOW)
Stage 5: HiFiGAN. Attempts to enhance the quality of the model. (usually quick)

It's just to give you something to play with really. So you don't have to wait so long.

The WIP model in question is Ryo in case your wondering.

Eishen · Jul 17, 2022

In my opinion you can do it as you see fit, both options are good.

I would like to try the program the next week, if I can.

ShenSun · Jul 19, 2022

Awesome thread. I'm gonna look into this program when I get some spare time

Thanks for letting us all know @Shendu

Shendu · Jul 19, 2022

ShenSun said:
Awesome thread. I'm gonna look into this program when I get some spare time

Thanks for letting us all know @Shendu

Not a problem. Happy to help!

LemonHaze · Jul 19, 2022

I'm sure @Dewey has some nice stuff to mention about this. As far as I know we're doing as much as we can with the dialogue for the UE project and it's very likely that it could be built so that it would also work with the original HD release (without the UE mod).

Eishen · Oct 20, 2022

Hello, @Shendu, at last I can begin with this, I have the program installed. How can I create a voice? I suposse that I must have a .mp3 file with all the voice, but how I load it into the program? I don't see anything to do it. Thanks for the help.

Seaman · Oct 20, 2022

So if I did read correctly, this isn't restricted to english voices right? Would It be too hard training japanese voice?
Thank you very much for sharing this with us.
One last thing: in future perspective, where do you think this tech could take us? To save you from scifilike predictions (I'll love to read some though) which could be the next stage?
Im gonna get Stevie Nicks.

Shendu · Oct 28, 2022

Eishen said:
Hello, @Shendu, at last I can begin with this, I have the program installed. How can I create a voice? I suposse that I must have a .mp3 file with all the voice, but how I load it into the program? I don't see anything to do it. Thanks for the help.

Sorry for the delay. The xVASynth application is for text-to-speech. Training voices requires xVATrainer. They're both available on Steam for free.

Training a voice requires that the audio sample be converted into wav at 22050 Hz(the trainer has a tool built-in for this).

The audio samples need to be individual sentences, no longer than 10 seconds, no shorter than 1 second(again, a feature is built-in for this purpose).

Then, you need to type word-for-word what is said in each sentence. You can use the auto-transcribe tool(built-in) to automatically populate the text field. After that, go through each file and make spelling/punctuation corrections etc.

You need decent hardware to be honest. It can take over a day to make one voice with a 3090.

Shendu · Oct 28, 2022

Seaman said:
So if I did read correctly, this isn't restricted to english voices right? Would It be too hard training japanese voice?
Thank you very much for sharing this with us.
One last thing: in future perspective, where do you think this tech could take us? To save you from scifilike predictions (I'll love to read some though) which could be the next stage?
Im gonna get Stevie Nicks.

The Trainer application has support for transcribing different languages I believe, but as far as text-to-speech in concerned, that's still being worked on. I hear v3 voice models will be multi-lingual.

Current voice models can be pretty good, but can sound unnatural and not very flexible(shouting, peaks & valleys in speech etc). As far as the near future is concerned. There could be issues with people deep-faking others voices. On a personal level, to mitigate this somewhat, I choose to train only fictional characters' voices.

Eishen · Oct 28, 2022

No problem, thank you very much, I will get to it when I can then

xVASynth Shenmue character voice models (Update 29/10/22)

Shendu

15.59 MB folder on MEGA

Attachments

Shendu

Attachments

Eishen

Shendu

Eishen

Shendu

Shendu

Eishen

ShenSun

Shendu

LemonHaze

Eishen

Seaman

ネットやろうぜ

Shendu

Shendu

Eishen

xVASynth Shenmue character voice models (Update 29/10/22)

Attachments

Attachments

​

ネットやろうぜ

​