xVASynth Shenmue character voice models (Update 29/10/22)

Joined
Jun 14, 2019
xVASynth is an AI tool for generating high-quality speech, which is available on Steam/Nexus for free.

I plan on training some voice models from the Shenmue series for xVASynth. It may be useful for anyone who wants to add custom voiced lines to the game(provided you can re-import the audio that is), or you could simply use it just to play around with.

A basic rundown on how to use the software.

Installation: Copy and extract the voice model zip into the root directory of xVASynth.


Update: Ren was a failure. Will have to retrain at some point.

Bug: models marked with a ' * ' have quality degradation issues when generating sentences in succession(might just be my system). As a temporary fix: generate audio for desired model, type in new sentence, change to another model and generate, then switch back.


Gui Zhang * added
Ine-san * added
Nozomi Harasaki * added
Mark Kimberly * added
Joy added
Shenhua added


NON-Shenmue:

Metal Gear Solid:
Colonel Campbell (Twin Snakes) added
Nastasha Romanenko (Twin Snakes) added

Naomi Hunter (MGS4) added
Raiden (MGRR) added
Sunny (MGRR) added

Dragon Ball:
Android 18 (Casual) added

Vegeta added

Resident Evil:

Jill Valentine (Resident Evil 3 1999) added
Leon S. Kennedy (RE4) added
Ingrid Hannigan (RE4) added

Mortal Kombat:

Cassie Cage added
Mileena added

Other:

ADA (Zone of the Enders) added
Madara Uchiha (Naruto) added
Balalaika (Black Lagoon) added
Alyx Vance (HL: A) added
Maleficent (Disney Infinity) added






Contains some references to Discord users. In the event that throws you off.
 

Attachments

  • gui_zhang_sample.zip
    169.1 KB · Views: 0
  • ine-san_sample.zip
    369.1 KB · Views: 1
  • nozomi_harasaki_sample.zip
    178.7 KB · Views: 1
  • mark_kimberly_sample.zip
    136.9 KB · Views: 0
  • joy_sample.zip
    173.7 KB · Views: 0
  • shenhua_sample.zip
    417.7 KB · Views: 0
Last edited:
If you have any issues with proper pronunciation, then that's because you have to enable the words you want to use in something called CMUDict. It's basically a large dictionary of words that tells the model how to says things using phonemes, which are the individual sounds used to construct words.


Mini-tutorial: How to enable words in CMUDict.

Image 1: Click on the 'ae' on the top-right of the main window.

image 2: Click CMUDict on the left.

Image 3: Type the word you want in the search box at the top and search through the list until you find it, then enable it by ticking the box on the left.

Image 4: It shouldn't be displayed in letter form like this.

Image 5: It should look like this. Displayed in phoneme form.


Tips:

Use the little guide shown on the right of CMUDict to construct your own words. e.g xiuying = SH UW1 IH0 NG . This sounds like 'shewing'.

Add spaces at the beginning and end of the sentence if you have any cutoff. This can also improve the sound of the sentence in general.

If a word sounds over or under-emphasized, adjust the phonemes of the letters by clicking on the right side of the pitch slider and dragging left or right to either shorten or lengthen it.

Alt + left-click word to select whole word to edit. e.g. 5.png

Ctrl + A to select all words for a universal edit.

Try alternating between different Vocoders(middle-right of the GUI). Bespoke HiFi GAN and Quick-and-dirty mostly. You may get better results.
 

Attachments

  • 1.png
    1.png
    2 MB · Views: 12
  • 2.png
    2.png
    474.6 KB · Views: 11
  • 3.png
    3.png
    502.5 KB · Views: 10
  • 4.png
    4.png
    1.6 MB · Views: 12
  • 5.png
    5.png
    1.6 MB · Views: 13
Last edited:
I'm very interested in this, Thank you very much for the tutorial. You can imitate all famous voices? Even voice actors from some anime?
 
If you can obtain a sufficient amount of relatively clean audio files for the character, then in theory, yes. You can train a model from pretty much anything.

Ideally, ten minutes plus of voice samples is good, but most of the models I've posted have less than ten minutes worth. The Nina Williams model for instance has only 1 minute and 33 seconds worth of voice samples. Xiuying has 4 minutes 50 seconds.

Gravelly voices and high pitch voices can be difficult to get right, but I've been experimenting with it for a while and have had pretty decent success with various types of characters. Some of the models posted are the result of me experimenting with datasets, small, large etc. Some sound good, some sound awful.

These are v2 models and soon(hopefully) I will upgrade some of them to v3. I think v3 models have emotion control, but don't quote me on that. :D
 
Would you all prefer to wait for me to complete the models? THEN upload it? Or would you like me to provide you with WIP Shenmue models to mess around with in the mean time, THEN release the full model? The quality likely won't be as good as the full model though and mileage may vary.

Training models is a slow process I'm afraid:

Stage 1 - 3: Give's you a basic model.(relatively quick)
Stage 4: Fine tuning to normalize speech patterns and make it less "pitchy" (SLOW)
Stage 5: HiFiGAN. Attempts to enhance the quality of the model. (usually quick)

It's just to give you something to play with really. So you don't have to wait so long.

The WIP model in question is Ryo in case your wondering.
 
In my opinion you can do it as you see fit, both options are good. ;) I would like to try the program the next week, if I can.
 
I'm sure @Dewey has some nice stuff to mention about this. As far as I know we're doing as much as we can with the dialogue for the UE project and it's very likely that it could be built so that it would also work with the original HD release (without the UE mod).
 
Hello, @Shendu, at last I can begin with this, I have the program installed. How can I create a voice? I suposse that I must have a .mp3 file with all the voice, but how I load it into the program? I don't see anything to do it. Thanks for the help.

 
So if I did read correctly, this isn't restricted to english voices right? Would It be too hard training japanese voice?
Thank you very much for sharing this with us.
One last thing: in future perspective, where do you think this tech could take us? To save you from scifilike predictions (I'll love to read some though) which could be the next stage?
Im gonna get Stevie Nicks.
 
Hello, @Shendu, at last I can begin with this, I have the program installed. How can I create a voice? I suposse that I must have a .mp3 file with all the voice, but how I load it into the program? I don't see anything to do it. Thanks for the help.

Sorry for the delay. The xVASynth application is for text-to-speech. Training voices requires xVATrainer. They're both available on Steam for free.

Training a voice requires that the audio sample be converted into wav at 22050 Hz(the trainer has a tool built-in for this).

The audio samples need to be individual sentences, no longer than 10 seconds, no shorter than 1 second(again, a feature is built-in for this purpose).

Then, you need to type word-for-word what is said in each sentence. You can use the auto-transcribe tool(built-in) to automatically populate the text field. After that, go through each file and make spelling/punctuation corrections etc.

You need decent hardware to be honest. It can take over a day to make one voice with a 3090.
 
Last edited:
So if I did read correctly, this isn't restricted to english voices right? Would It be too hard training japanese voice?
Thank you very much for sharing this with us.
One last thing: in future perspective, where do you think this tech could take us? To save you from scifilike predictions (I'll love to read some though) which could be the next stage?
Im gonna get Stevie Nicks.
The Trainer application has support for transcribing different languages I believe, but as far as text-to-speech in concerned, that's still being worked on. I hear v3 voice models will be multi-lingual.

Current voice models can be pretty good, but can sound unnatural and not very flexible(shouting, peaks & valleys in speech etc). As far as the near future is concerned. There could be issues with people deep-faking others voices. On a personal level, to mitigate this somewhat, I choose to train only fictional characters' voices.
 
Back
Top