better readme

2 years ago · bd63487c76
parent 3b1ba91e30
commit bd63487c76
1 changed files with 77 additions and 116 deletions
--- a/README.md
+++ b/README.md
@ -1,157 +1,118 @@
+# GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI

-# demo video and features
+## Demo Video and Features

-demo video in Chinese: https://www.bilibili.com/video/BV12g4y1m7Uw/
-
-few shot fine tuning demo:
+Check out our demo video in Chinese: [Bilibili Demo](https://www.bilibili.com/video/BV12g4y1m7Uw/)

 https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb

-features:
-
-1、input 5s vocal, zero shot TTS
-
-2、1min training dataset, fine tune (few shot TTS. The TTS model trained using few-shot techniques exhibits significantly better similarity and realism in the speaker's voice compared to zero-shot.)
-
-3、Cross lingual (inference another language that is different from the training dataset language), now support English, Japanese and Chinese
+### Features:

-4、This WebUI integrates tools such as voice accompaniment separation, automatic segmentation of training sets, Chinese ASR, text labeling, etc., to help beginners quickly create their own training datasets and GPT/SoVITS models.
+1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion.

-# todolist
+2. **Few-shot TTS:** Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.

-0、High priority: Localization in Japanese and English. User guide. 
+3. **Cross-lingual Support:** Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.

-1、zero shot voice conversion(5s) /few shot voice converion(1min)
+4. **WebUI Tools:** Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.

-2、TTS speaking speed control
+## Todo List

-3、more TTS emotion control
+0. **High Priority:**
+   - Localization in Japanese and English.
+   - User guide.

-4、experiment about change sovits token inputs to probability distribution of vocabs
+1. **Features:**
+   - Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
+   - TTS speaking speed control.
+   - Enhanced TTS emotion control.
+   - Experiment with changing SoVITS token inputs to probability distribution of vocabs.
+   - Improve English and Japanese text frontend.
+   - Develop tiny and larger-sized TTS models.
+   - Colab scripts.
+   - Expand training dataset (2k -> 10k).

-5、better English and Japanese text frontend
+## Requirements (How to Install)

-6、tiny version and larger-sized TTS models
+### Python and PyTorch Version

-7、colab scripts
+Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.

-8、more training dataset(2k->10k)
+### Pip Packages

-# Requirments (How to install)
-
-## python and pytorch version
-py39+pytorch2.0.1+cu11 passed the test.
-
-## pip packages
+```bash
 pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en
+```
+
+### Additional Requirements

-## additionally
-If you need the Chinese ASR feature supported by funasr, you should
+If you need Chinese ASR (supported by FunASR), install:

+```bash
 pip install modelscope torchaudio sentencepiece funasr
+```

-## You need ffmpeg.
+### FFmpeg
+
+#### Ubuntu/Debian Users

-### Ubuntu/Debian users
 ```bash
 sudo apt install ffmpeg
 ```
-### MacOS users
+
+#### MacOS Users
+
 ```bash
 brew install ffmpeg
 ```
-### Windows users
-download and put them in the GPT-SoVITS root.
- download [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
-
- download [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
-
-## You need download some pretrained models
-
-### pretrained GPT-SoVITS models/SSL feature model/Chinese BERT model
-
-put these files
-
-https://huggingface.co/lj1995/GPT-SoVITS
-
-to 
-
-GPT_SoVITS\pretrained_models
-
-### Chinese ASR (Additionally)
-
-put these files
-
-https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files
-
-https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/files
-
-https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files
-
- to 

-tools/damo_asr/models
+#### Windows Users

- ![image](https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/aa376752-9f9d-4101-9a09-867bf4df6f6a)
+Download and place [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe) and [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe) in the GPT-SoVITS root.

-### UVR5 (Vocals/Accompaniment Separation & Reverberation Removal. Additionally) 
+### Pretrained Models

-put the models you need from 
+Download pretrained models from [GPT-SoVITS Models](https://huggingface.co/lj1995/GPT-SoVITS) and place them in `GPT_SoVITS\pretrained_models`.

-https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights
+For Chinese ASR, download models from [Damo ASR Models](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files) and place them in `tools/damo_asr/models`.

-to
+For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from [UVR5 Weights](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights) and place them in `tools/uvr5/uvr5_weights`.

-tools/uvr5/uvr5_weights
+## Dataset Format

-# dataset format
+The TTS annotation .list file format:

-The format of the TTS annotation .list file:
-
-vocal path|speaker_name|language|text
-
-e.g. D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
-
-language dictionary:
-
-    'zh': Chinese
-    
-    "ja": Japanese
-    
-    'en': English
-    
-
-
-# Credits
-
-https://github.com/innnky/ar-vits
-
-https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR
-
-https://github.com/jaywalnut310/vits
-
-https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556
-
-https://github.com/TencentGameMate/chinese_speech_pretrain
-
-https://github.com/auspicious3000/contentvec/
-
-https://github.com/jik876/hifi-gan
-
-https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
-
-https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41
-
-https://github.com/Anjok07/ultimatevocalremovergui
-
-https://github.com/openvpi/audio-slicer
-
-https://github.com/cronrpc/SubFix
-
-https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
-
-https://github.com/FFmpeg/FFmpeg
+```
+vocal_path|speaker_name|language|text
+```

-https://github.com/gradio-app/gradio
+Example:

+```
+D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
+```

+Language dictionary:
+
+- 'zh': Chinese
+- 'ja': Japanese
+- 'en': English
+
+## Credits
+
+Special thanks to the following projects and contributors:
+
+- [ar-vits](https://github.com/innnky/ar-vits)
+- [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR)
+- [vits](https://github.com/jaywalnut310/vits)
+- [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556)
+- [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain)
+- [contentvec](https://github.com/auspicious3000/contentvec/)
+- [hifi-gan](https://github.com/jik876/hifi-gan)
+- [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large)
+- [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41)
+- [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui)
+- [audio-slicer](https://github.com/openvpi/audio-slicer)
+- [SubFix](https://github.com/cronrpc/SubFix)
+- [FFmpeg](https://github.com/FFmpeg/FFmpeg)
+- [gradio](https://github.com/gradio-app/gradio)