Merge pull request #6 from blaise-tk/main

Better readme
main
RVC-Boss 2 years ago committed by GitHub
commit 2078ad1177
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,157 +1,118 @@
# GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI
# demo video and features ## Demo Video and Features
demo video in Chinese: https://www.bilibili.com/video/BV12g4y1m7Uw/ Check out our demo video in Chinese: [Bilibili Demo](https://www.bilibili.com/video/BV12g4y1m7Uw/)
few shot fine tuning demo:
https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb
features: ### Features:
1、input 5s vocal, zero shot TTS
2、1min training dataset, fine tune (few shot TTS. The TTS model trained using few-shot techniques exhibits significantly better similarity and realism in the speaker's voice compared to zero-shot.)
3、Cross lingual (inference another language that is different from the training dataset language), now support English, Japanese and Chinese
4、This WebUI integrates tools such as voice accompaniment separation, automatic segmentation of training sets, Chinese ASR, text labeling, etc., to help beginners quickly create their own training datasets and GPT/SoVITS models. 1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion.
# todolist 2. **Few-shot TTS:** Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
0、High priority: Localization in Japanese and English. User guide. 3. **Cross-lingual Support:** Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
1、zero shot voice conversion(5s) /few shot voice converion(1min) 4. **WebUI Tools:** Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
2、TTS speaking speed control ## Todo List
3、more TTS emotion control 0. **High Priority:**
- Localization in Japanese and English.
- User guide.
4、experiment about change sovits token inputs to probability distribution of vocabs 1. **Features:**
- Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
- TTS speaking speed control.
- Enhanced TTS emotion control.
- Experiment with changing SoVITS token inputs to probability distribution of vocabs.
- Improve English and Japanese text frontend.
- Develop tiny and larger-sized TTS models.
- Colab scripts.
- Expand training dataset (2k -> 10k).
5、better English and Japanese text frontend ## Requirements (How to Install)
6、tiny version and larger-sized TTS models ### Python and PyTorch Version
7、colab scripts Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.
8、more training dataset(2k->10k) ### Pip Packages
# Requirments (How to install) ```bash
## python and pytorch version
py39+pytorch2.0.1+cu11 passed the test.
## pip packages
pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en
```
### Additional Requirements
## additionally If you need Chinese ASR (supported by FunASR), install:
If you need the Chinese ASR feature supported by funasr, you should
```bash
pip install modelscope torchaudio sentencepiece funasr pip install modelscope torchaudio sentencepiece funasr
```
### FFmpeg
## You need ffmpeg. #### Ubuntu/Debian Users
### Ubuntu/Debian users
```bash ```bash
sudo apt install ffmpeg sudo apt install ffmpeg
``` ```
### MacOS users
#### MacOS Users
```bash ```bash
brew install ffmpeg brew install ffmpeg
``` ```
### Windows users
download and put them in the GPT-SoVITS root.
- download [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
- download [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
## You need download some pretrained models
### pretrained GPT-SoVITS models/SSL feature model/Chinese BERT model
put these files
https://huggingface.co/lj1995/GPT-SoVITS
to
GPT_SoVITS\pretrained_models
### Chinese ASR (Additionally)
put these files
https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files
https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/files
https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files
to #### Windows Users
tools/damo_asr/models Download and place [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe) and [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe) in the GPT-SoVITS root.
![image](https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/aa376752-9f9d-4101-9a09-867bf4df6f6a) ### Pretrained Models
### UVR5 (Vocals/Accompaniment Separation & Reverberation Removal. Additionally) Download pretrained models from [GPT-SoVITS Models](https://huggingface.co/lj1995/GPT-SoVITS) and place them in `GPT_SoVITS\pretrained_models`.
put the models you need from For Chinese ASR, download models from [Damo ASR Models](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files) and place them in `tools/damo_asr/models`.
https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from [UVR5 Weights](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights) and place them in `tools/uvr5/uvr5_weights`.
to ## Dataset Format
tools/uvr5/uvr5_weights The TTS annotation .list file format:
# dataset format ```
vocal_path|speaker_name|language|text
The format of the TTS annotation .list file: ```
vocal path|speaker_name|language|text
e.g. D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
language dictionary:
'zh': Chinese
"ja": Japanese
'en': English
# Credits
https://github.com/innnky/ar-vits
https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR
https://github.com/jaywalnut310/vits
https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556
https://github.com/TencentGameMate/chinese_speech_pretrain
https://github.com/auspicious3000/contentvec/
https://github.com/jik876/hifi-gan
https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41
https://github.com/Anjok07/ultimatevocalremovergui
https://github.com/openvpi/audio-slicer
https://github.com/cronrpc/SubFix
https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
https://github.com/FFmpeg/FFmpeg
https://github.com/gradio-app/gradio Example:
```
D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
```
Language dictionary:
- 'zh': Chinese
- 'ja': Japanese
- 'en': English
## Credits
Special thanks to the following projects and contributors:
- [ar-vits](https://github.com/innnky/ar-vits)
- [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR)
- [vits](https://github.com/jaywalnut310/vits)
- [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556)
- [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain)
- [contentvec](https://github.com/auspicious3000/contentvec/)
- [hifi-gan](https://github.com/jik876/hifi-gan)
- [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large)
- [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41)
- [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui)
- [audio-slicer](https://github.com/openvpi/audio-slicer)
- [SubFix](https://github.com/cronrpc/SubFix)
- [FFmpeg](https://github.com/FFmpeg/FFmpeg)
- [gradio](https://github.com/gradio-app/gradio)
Loading…
Cancel
Save