On Wednesday, Stability AI released Stable Diffusion XL 1.0 (SDXL), its next-generation open weights AI image synthesis model. It can generate novel images from text descriptions and produces more detail and higher-resolution imagery than previous versions of Stable Diffusion.
As with Stable Diffusion 1.4, which made waves last August with an open source release, anyone with the proper hardware and technical know-how can download the SDXL files and run the model locally on their own machine for free.
Local operation means that there is no need to pay for access to the SDXL model, there are few censorship concerns, and the weights files (which contain the neutral network data that makes the model function) can be fine-tuned to generate specific types of imagery by hobbyists in the future.
For example, with Stable Diffusion 1.5, the default model (trained on a scrape of images downloaded from the Internet) can generate a broad scope of imagery, but it doesn’t perform as well with more niche subjects. To make up for that, hobbyists fine-tuned SD 1.5 into custom models (and later, LoRA models) that improved Stable Diffusion’s ability to generate certain aesthetics, including Disney-style art, Anime art, landscapes, bespoke pornography, images of famous actors or characters, and more. Stability AI expects that community-driven development trend to continue with SDXL, allowing people to extend its rendering capabilities far beyond the base model.
Upgrades under the hood
Like other latent diffusion image generators, SDXL starts with random noise and “recognizes” images in the noise based on guidance from a text prompt, refining the image step by step. But SDXL utilizes a “three times larger UNet backbone,” according to Stability, with more model parameters to pull off its tricks than earlier Stable Diffusion models. In plain language, that means the SDXL architecture does more processing to get the resulting image.
To generate images, SDXL utilizes an “ensemble of experts” architecture that guides a latent diffusion process. Ensemble of experts refers to a methodology where an initial single model is trained and then split into specialized models that are specifically trained for different stages of the generation process, which improves image quality. In this case, there is a base SDXL model and an optional “refiner” model that can run after the initial generation to make images look better.
Notably, SDXL also uses two different text encoders that make sense of the written prompt, helping to pinpoint associated imagery encoded in the model weights. Users can provide a different prompt to each encoder, resulting in novel, high-quality concept combinations. On Twitter, Xander Steenbrugge showed an example of a combined elephant and an octopus using this technique.
And then there are improvements in image detail and size. While Stable Diffusion 1.5 was trained on 512×512 pixel images (making that the optimal generation image size but lacking detail for small features), Stable Diffusion 2.x increased that to 768×768. Now, Stability AI recommends generating 1024×1024 pixel images with Stable Diffusion XL, resulting in greater detail than an image of similar size generated by SD 1.5.