diff --git a/GPT_SoVITS/f5_tts/model/backbones/README.md b/GPT_SoVITS/f5_tts/model/backbones/README.md new file mode 100644 index 0000000..155671e --- /dev/null +++ b/GPT_SoVITS/f5_tts/model/backbones/README.md @@ -0,0 +1,20 @@ +## Backbones quick introduction + + +### unett.py +- flat unet transformer +- structure same as in e2-tts & voicebox paper except using rotary pos emb +- update: allow possible abs pos emb & convnextv2 blocks for embedded text before concat + +### dit.py +- adaln-zero dit +- embedded timestep as condition +- concatted noised_input + masked_cond + embedded_text, linear proj in +- possible abs pos emb & convnextv2 blocks for embedded text before concat +- possible long skip connection (first layer to last layer) + +### mmdit.py +- sd3 structure +- timestep as condition +- left stream: text embedded and applied a abs pos emb +- right stream: masked_cond & noised_input concatted and with same conv pos emb as unett