Researchers at Stanford University have developed ControlNet, a new neural network structure that enhances Stable Diffusion’s image-to-image capabilities. This new technology allows for more precise control over AI models in the image and video creation by providing a way to tell an AI model which parts of an input image to keep, resulting in more consistent outputs.
With the traditional image-to-image pipeline used by Stable Diffusion, the AI model has limited control over which elements from the original image to keep and which to discard. ControlNet addresses this issue by enabling Stable Diffusion models to use additional input conditions that provide more information to the AI model about how to manipulate an image or video. By providing specific input conditions, users can control the diffusion model to generate images or videos with specific characteristics.
ControlNet copies the weights of each block of Stable Diffusion into a trainable variant and a locked variant, allowing for fine-tuning with small data sets while preserving the capabilities of the production-ready diffusion model. “No layer is trained from scratch. You are still fine-tuning. Your original model is safe,” the researchers explain. This makes training possible even on a GPU with only eight gigabytes of graphics memory.
The Stanford team has published a range of pre-trained ControlNet models that offer greater control over the image-to-image pipeline. These include models for edge and line detection, boundary detection, depth information, sketch processing, and human pose or semantic map detection. All ControlNet models can be used with Stable Diffusion, providing much better control over the generative AI.
One example of ControlNet’s practical use is with the Canny Edge model, which uses an edge detection algorithm to derive a Canny edge image from a given input image and then uses both for further diffusion-based image generation. The Scribble model, on the other hand, can turn quick doodles into stunning photorealistic images. Another example is the Hough Line model, which uses a learning-based deep Hough transform to detect straight lines from image data and generate captions based on them.
ControlNet represents a major step forward in developing a highly configurable AI toolkit for creators, offering more precise control over image and video generation than ever before. With spatial consistency addressed, new advances in temporal consistency and AI cinema can be expected in the future.