Merge pull request #6 from blaise-tk/main

Better readme
2 years ago · 2078ad1177
parent d158fcb6fa bd63487c76
commit 2078ad1177
1 changed files with 77 additions and 116 deletions
--- a/README.md
+++ b/README.md
@ -1,157 +1,118 @@
 # GPT-SoVITS - Voice Conversion and Text-to-Speech WebUI
-# demo video and features
+## Demo Video and Features
-demo video in Chinese: https://www.bilibili.com/video/BV12g4y1m7Uw/
+Check out our demo video in Chinese: [Bilibili Demo](https://www.bilibili.com/video/BV12g4y1m7Uw/)
 few shot fine tuning demo:
 https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/05bee1fa-bdd8-4d85-9350-80c060ab47fb
-features:
+### Features:
 1、input 5s vocal, zero shot TTS
 2、1min training dataset, fine tune (few shot TTS. The TTS model trained using few-shot techniques exhibits significantly better similarity and realism in the speaker's voice compared to zero-shot.)
 3、Cross lingual (inference another language that is different from the training dataset language), now support English, Japanese and Chinese
-4、This WebUI integrates tools such as voice accompaniment separation, automatic segmentation of training sets, Chinese ASR, text labeling, etc., to help beginners quickly create their own training datasets and GPT/SoVITS models.
+1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion.
-# todolist
+2. **Few-shot TTS:** Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.
-0、High priority: Localization in Japanese and English. User guide. 
+3. **Cross-lingual Support:** Inference in languages different from the training dataset, currently supporting English, Japanese, and Chinese.
-1、zero shot voice conversion(5s) /few shot voice converion(1min)
+4. **WebUI Tools:** Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT/SoVITS models.
-2、TTS speaking speed control
+## Todo List
-3、more TTS emotion control
+0. **High Priority:**
   - Localization in Japanese and English.
   - User guide.
-4、experiment about change sovits token inputs to probability distribution of vocabs
+1. **Features:**
   - Zero-shot voice conversion (5s) / few-shot voice conversion (1min).
   - TTS speaking speed control.
   - Enhanced TTS emotion control.
   - Experiment with changing SoVITS token inputs to probability distribution of vocabs.
   - Improve English and Japanese text frontend.
   - Develop tiny and larger-sized TTS models.
   - Colab scripts.
   - Expand training dataset (2k -> 10k).
-5、better English and Japanese text frontend
+## Requirements (How to Install)
-6、tiny version and larger-sized TTS models
+### Python and PyTorch Version
-7、colab scripts
+Tested with Python 3.9, PyTorch 2.0.1, and CUDA 11.
-8、more training dataset(2k->10k)
+### Pip Packages
-# Requirments (How to install)
+```bash
 ## python and pytorch version
 py39+pytorch2.0.1+cu11 passed the test.
 ## pip packages
 pip install torch numpy scipy tensorboard librosa==0.9.2 numba==0.56.4 pytorch-lightning gradio==3.14.0 ffmpeg-python onnxruntime tqdm==4.59.0 cn2an pypinyin pyopenjtalk g2p_en
 ```
 ### Additional Requirements
-## additionally
+If you need Chinese ASR (supported by FunASR), install:
 If you need the Chinese ASR feature supported by funasr, you should
 ```bash
 pip install modelscope torchaudio sentencepiece funasr
 ```
 ### FFmpeg
-## You need ffmpeg.
+#### Ubuntu/Debian Users
 ### Ubuntu/Debian users
 ```bash
 sudo apt install ffmpeg
 ```
-### MacOS users
+
 #### MacOS Users
 ```bash
 brew install ffmpeg
 ```
 ### Windows users
 download and put them in the GPT-SoVITS root.
 - download [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe)
 - download [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe)
 ## You need download some pretrained models
 ### pretrained GPT-SoVITS models/SSL feature model/Chinese BERT model
 put these files
 https://huggingface.co/lj1995/GPT-SoVITS
 to 
 GPT_SoVITS\pretrained_models
 ### Chinese ASR (Additionally)
 put these files
 https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files
 https://modelscope.cn/models/damo/speech_fsmn_vad_zh-cn-16k-common-pytorch/files
 https://modelscope.cn/models/damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch/files
- to 
+#### Windows Users
-tools/damo_asr/models
+Download and place [ffmpeg.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffmpeg.exe) and [ffprobe.exe](https://huggingface.co/lj1995/VoiceConversionWebUI/blob/main/ffprobe.exe) in the GPT-SoVITS root.
- ![image](https://github.com/RVC-Boss/GPT-SoVITS/assets/129054828/aa376752-9f9d-4101-9a09-867bf4df6f6a)
+### Pretrained Models
-### UVR5 (Vocals/Accompaniment Separation & Reverberation Removal. Additionally) 
+Download pretrained models from [GPT-SoVITS Models](https://huggingface.co/lj1995/GPT-SoVITS) and place them in `GPT_SoVITS\pretrained_models`.
-put the models you need from 
+For Chinese ASR, download models from [Damo ASR Models](https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/files) and place them in `tools/damo_asr/models`.
-https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights
+For UVR5 (Vocals/Accompaniment Separation & Reverberation Removal), download models from [UVR5 Weights](https://huggingface.co/lj1995/VoiceConversionWebUI/tree/main/uvr5_weights) and place them in `tools/uvr5/uvr5_weights`.
-to
+## Dataset Format
-tools/uvr5/uvr5_weights
+The TTS annotation .list file format:
-# dataset format
+```
-
+vocal_path|speaker_name|language|text
-The format of the TTS annotation .list file:
+```
 vocal path|speaker_name|language|text
 e.g. D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
 language dictionary:
    'zh': Chinese
    "ja": Japanese
    'en': English
 # Credits
 https://github.com/innnky/ar-vits
 https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR
 https://github.com/jaywalnut310/vits
 https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556
 https://github.com/TencentGameMate/chinese_speech_pretrain
 https://github.com/auspicious3000/contentvec/
 https://github.com/jik876/hifi-gan
 https://huggingface.co/hfl/chinese-roberta-wwm-ext-large
 https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41
 https://github.com/Anjok07/ultimatevocalremovergui
 https://github.com/openvpi/audio-slicer
 https://github.com/cronrpc/SubFix
 https://modelscope.cn/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch
 https://github.com/FFmpeg/FFmpeg
-https://github.com/gradio-app/gradio
+Example:
 ```
 D:\GPT-SoVITS\xxx/xxx.wav|xxx|en|I like playing Genshin.
 ```
 Language dictionary:
 - 'zh': Chinese
 - 'ja': Japanese
 - 'en': English
 ## Credits
 Special thanks to the following projects and contributors:
 - [ar-vits](https://github.com/innnky/ar-vits)
 - [SoundStorm](https://github.com/yangdongchao/SoundStorm/tree/master/soundstorm/s1/AR)
 - [vits](https://github.com/jaywalnut310/vits)
 - [TransferTTS](https://github.com/hcy71o/TransferTTS/blob/master/models.py#L556)
 - [Chinese Speech Pretrain](https://github.com/TencentGameMate/chinese_speech_pretrain)
 - [contentvec](https://github.com/auspicious3000/contentvec/)
 - [hifi-gan](https://github.com/jik876/hifi-gan)
 - [Chinese-Roberta-WWM-Ext-Large](https://huggingface.co/hfl/chinese-roberta-wwm-ext-large)
 - [fish-speech](https://github.com/fishaudio/fish-speech/blob/main/tools/llama/generate.py#L41)
 - [ultimatevocalremovergui](https://github.com/Anjok07/ultimatevocalremovergui)
 - [audio-slicer](https://github.com/openvpi/audio-slicer)
 - [SubFix](https://github.com/cronrpc/SubFix)
 - [FFmpeg](https://github.com/FFmpeg/FFmpeg)
 - [gradio](https://github.com/gradio-app/gradio)