Guide to Manually Captioning Your Dataset for LoRA Training on SDXL

Welcome to the world of AI model training! Today, we're diving into how to manually caption a dataset for training a LoRA model on SDXL.

This guide is tailored for intermediate users. All default settings have been created with the automated captioning in mind. If you attempt to manually caption and the results are not as you expect, please keep the auto-captions generated when you upload your dataset.

Introduction

Image captioning within generative AI is a process that allows users to disentangle the parts of their datasets that they want to be able to prompt from the parts of their dataset that are not crucial to training. These captions teach the AI model to understand and interpret various elements within a picture. 

The Basics of Captioning for LoRA on SDXL

1. Understand the Caption Style

It can be helpful to think of captions as three separate elements that you may wish to utilize: trigger, class, and descriptors.

  • The trigger is a specific keyword to activate the subject or style in model generations. The trigger should be a unique token that is not already recognized by the foundation model. It is possible to test how unique a trigger is by prompting your unique word using SDXL under "Foundational" in the model library, and ensuring nothing discernible is generated. 
  • A common way to create a unique trigger is to choose a memorable word and remove all the vowels in that word. For example "chicken" might become the unique trigger word "chkn".
  • Class describes any important subjects in an image, such as a man, woman, sword, or otherwise.
  • Descriptors include unique details like actions, colors, or emotions that are not present in all images. This can also include styles of art.

It is recommended when embarking on advanced training to make note of any words, including trigger words, that are included in the caption to be used later. Saving these words in the tags section of the model's page can be helpful.

2. Creating Effective Captions

Effective captions vary depending on training goals. Here are some helpful rules of thumb. Considering the image below, here are some methods for approaching captioning based on the goal of your training.


2010s Cartoon Renders_Li3azIKETG2GDevxDkdCHA_A steam-powered airship hovering over a misty, floating island_inference-txt2img_1699977148

  • If the goal is to train the style of the image a user might describe various aspects of the scene, including the air ship, the clouds, and the houses on the ground below. In this type of caption, a unique trigger word is not indicated. A caption might read:

    an airship flying through the clouds above islands and houses
  • If the goal is to train the airship as the subject of a LoRA model a user typically needs fewer descriptive words. In this case, a trigger word is optional, and can be used to indicate a particularly kind of airship. It is less important to describe all the details in the scene, and more important to describe them as they relate to the airship. It is also useful to use any words you may want to be able to prompt again, such as colors, which the model may have more association with in the future due to their inclusion. An example of this type of caption could be:

    a cldlk airship in the sky with a big yellow blob and red and blue patterns on it

Captioning can be leveraged in nuanced and creative ways. These tips are meant to inspired, however the general principles can be expanded upon for more advanced captioning.

3. Consistency is Key

Ensure consistency in the captioning approach across the dataset. This helps the AI model learn and apply concepts more effectively.

Best Practices

- Adapt Based on Desired Results

The captioning approach should align with the type of model you wish to train. The text encoder also can have a heavy impact on the efficacy of captions, and it may be useful to increase the learning rate of the text encoder if the captions are not being learned.

- Work with the Foundational Model

One of the easiest ways to leverage captioning is to find where the foundational model (SDXL) already associates concepts, styles, and aesthetics, and use those in your captions and prompts. The Foundational model is strong, and it can be easier to work with it than to try and fight against it's training.

- Clean Up Images

Before captioning, it’s advisable to clean up your images. Remove any unwanted elements like stray objects or watermarks.

Conclusion

Captioning for LoRA training on SDXL might seem daunting at first, but with these simple steps, you can master the process. Remember, the goal is to create a diverse and accurate set of captions that help your AI model understand the world it is representing. Happy training!