Curating a Training Dataset

Curating a training dataset for a LoRA model, whether for Character or Object models or for general Styles, involves several key principles

General Guidelines for All Models

1. Dataset Size and Scope:

Start with 5-15 images, potentially expanding up to 30 or more. Ensure the dataset is large enough for the model to learn relevant patterns but not so large that it becomes unwieldy.

2. Consistency vs. Variety:

Maintain consistency in the aspects you wish to train, like subject or aesthetics, but introduce diversity for elements you don't want the model to specifically remember.

3. Image Quality:

High-resolution images are preferred. Square images work best and can be cropped and resized in the Scenario web app.

4. Testing Different Datasets:

Experiment with various datasets to understand what works best for your specific needs.

5. Avoid Overfitting:

Ensure not to have too many similar images to prevent the model from becoming overly specialized.

For Character or Object Models

1. Diverse Imagery:

Include images with various poses, lighting conditions, and body shots for flexibility and accuracy in outcomes. Focus on different angles, facial expressions, and body positions.

2. Contextual Variety:

Incorporate the character in multiple contexts and settings to help the model understand different scenarios.

For Style Models

1. Consistent Styling:

The dataset should uniformly represent the style you're training for. If it's an anime style, for example, use specific anime tagging systems for descriptions.

2. Focused Imagery:

Only include images that match the desired style. For instance, if you are training your own unique art style, only include images from your own style's collection.

3. Image Cleaning:

Remove unwanted elements like stray objects or signatures to keep images clean.

For more information on image formatting follow this link

  • The aspects of either subject, composition, or aesthetics that you want to maintain in your model should be consistent. In most cases, try to avoid too much variety in the output you are trying to accomplish when you are just beginning.
  • Any elements you do not want the AI to specifically remember should be as diversely represented as possible. For example, if you are training an illustration style, be aware that if you include too many images that contain elephants, the AI may assume that elephants are a part of the desired output.

     Building a good dataset is a craft, and it will take time to learn. We recommend checking out our tutorials, looking at the sample training sets we’ve included, and testing a few different datasets when you train a new generator

By adhering to these guidelines, you can effectively curate a dataset that helps train a LoRA model tailored to your specific requirements, whether it's for specific characters or general styles. The process involves a balance between consistency in key elements and diversity in secondary aspects to ensure the model learns effectively without biases or narrow focus.