Using the Text Encoder

Text encoding is the process used to more accurately train the CLIP program to more accurately interpret the image dataset.

The CLIP program was originally trained on billions of images and the tokens trained get their words and phrases from metatags

Fine-tuning the text encoder seems to produce the best results, especially with faces.

  • It generates more true to data images
  • Allows better prompt interpretability, i.e it can handle more complex prompts.

     By fine-tuning the text encoder, you can achieve better results and handle more complex prompts. Overall, text encoding can help improve the performance of your model and enable it to generate more accurate and detailed images.