Composition Control for Advanced Users
The Functionality of Composition Control
Composition Control essentially leverages a tool known within the GenAi community as Controlnet. Controlnet is a framework that can be incorporated into the image creation program to "direct" the generation of output images by extracting new data from the input images.
What makes Composition Control uniquely effective is its ability to add new parameters to the image creation process. Each mode is designed to seek out specific visual data and comes equipped with a preprocessor that readies any reference image for inferencing.
Simplifying Composition Control: A Video Game Analogy
Think of Composition Control as a powerful equipment upgrade in an Adventure RPG Game. Without Composition Control, our game character (the user,) can only interact with the game environment (output images) in basic ways, much like a character without any special abilities or items.
When a reference image is provided without the Composition Control, the AI's interpretation can vary depending on how much influence you set your image to. Imagine it like adjusting the game's difficulty level - if you lower the influence it is like increasing the difficulty of the game, the AI abstracts more, like a game that offers broader, less specific clues. It takes more skill to communicate the details you want it to draw from the reference image. It bases its information on general elements like the color palettes and shapes in the image.
If you raise the img2img influence, it is like lowering the difficulty level of the game. The AI's output will closely resemble the image you've provided. This can be very powerful for use cases where you are trying to simply alter the style of an input image. It also can be very limiting, depending on a users goals.
Previously, this way of interpreting reference images made it challenging to reproduce specific aspects like poses, line work, shapes, and depth within an image. Using the earlier metaphor, it is much like how a basic game character build might lead to a player struggling with advanced challenges.
Leveling Up with Composition Control
But now, enter Composition Control, our game-changing equipment upgrade. It introduces pre-processors which convert images to ‘detectmaps’. Preprocessors detect specific features of an image environment (depth, edges, poses, etc). This information is turned into a detectmap and provide control over the output. In the Game Analogy different types of game maps revealing specific details of the game world.
This feature allows users to exert more precision. Once the preprocessed image runs through the selected mode, which has been trained to identify very specific information from a particular kind of detectmap, the generator is able to produce far more nuanced outputs. It's like having an advanced power-up that lets your character interact with the game world in ways that were previously impossible, offering an enhanced and more controlled image generation experience.
These are examples of the preprocessed images that the model uses as reference in the image generation process and not a reflection of the final output.
To make the process of Composition Control easier to visualize, we’ve created two grids, and run them through our basic modes. Our basic modes consist of Structure, Pose, Depth, Lines, and Segmentation.You will see additional modes such as City and Interior - these are ‘Advanced’ modes which use a mix of models and preprocessors.
In this grid we’ve picked images that range, to give a better insight on a multitude of outputs.
In this grid we have only used images of real people, to show specifically how Character and Pose mode respond more accurately to images of humans. However, we have included this sample in all the modes we are testing, in part to give ideas for different ways to leverage a workflow.
Structure Mode is known in the community as a Canny Map. This preprocessor leverages a “Canny Edge Detector,” which is a well established mapping tool in the tech world. Structure Mode retains more details from the original image.
As you can see, Structure picks up a significant number of details.
The Depth Mode generates a depth map during the preprocessing stage. This is used to provide nuanced information which is then interpreted by the Depth Mode model in conjunction with a custom generator.
The sketch of a house was very obviously 2D, and as you can see the Depth Mode has not produced as much information for it.
Images of realistic looking people consistently get clear maps, as there is an obvious foreground and background.
As you can see by the images below, pose mode draws the most complete information from realistic images of people. However, that is only true for the input images. Pose mode does understand and can translate those poses for output characters. Note: typically pose mode does not rescale it’s poses, so it is recommended for smaller characters to use images of people with similar proportions to your ideal output.
In this case, there is an obvious lack of recognition of any poses, even in images of characters. This is because they are not full poses of real people. The only image of a real person is too zoomed in, and so the preprocessor has little context for a pose. Even so, it recognizes the human shape.
Alternatively, the grid with just images of real people is easily recognized by Pose Mode.
The lines model and preprocessors identify straight lines and corners in a given image. This means that images with very few straight lines will give very little information to the Lines Mode. Lines is ideal for generating structures and other images where linework being straight is of the utmost importance.
As you can see in this sample, the images with the most distinct straight lines and corners have the most complete mapping.
This is made even more obvious. when you see that the preprocessor completely disregards the human subjects in this grid. Their lines are not straight enough, and so it only picks up the background details that are relevant to it's mapping.
Segmentation mode creates hard, distinct segments of color to identify the general shape of the main objects it can detect in a reference image. This will impose a lot of information from the style of the generator being used, and retain only the segmented information it preprocesses.
Line Art mode identifies the natural edges in an image, first converting them into a Line Art mode map, and coloring them in based on the style of the generator being used. It pays attention to shadow, color, and line art. This mode works both with images that are meant to look like traditional lineart as well as full color images, as the pre-processor creates a map like the one you will see below.
Normal Map Mode
Normal Map mode operates similarly to Depth Mode, the main difference being that Normal Map Mode also brings in additional textural elements it perceives from the reference image. Users who work with Normal Maps can directly upload an image of their normal map and turn off mode mapping if they would prefer.
Scribble is exactly as it sounds - it takes rough sketches and uses their visual information to create more complex outputs based on your generator style. Scribble is best used on simple drawings, however it is entirely possible to input any image into Scribble Mode and it will be preprocessed with varying potential outputs.
Updated on: 23/05/2023