The big breakthrough behind the new models is in the way images get generated. The first version of DALL-E used an extension of the technology behind OpenAI’s language model GPT-3, producing images by predicting the next pixel in an image as if they were words in a sentence. This worked, but not well. “It was not a magical experience,” says Altman. “It’s amazing that it worked at all.”
Instead, DALL-E 2 uses something called a diffusion model. Diffusion models are neural networks trained to clean images up by removing pixelated noise that the training process adds. The process involves taking images and changing a few pixels in them at a time, over many steps, until the original images are erased and you’re left with nothing but random pixels. “If you do this a thousand times, eventually the image looks like you have plucked the antenna cable from your TV set—it’s just snow,” says Björn Ommer, who works on generative AI at the University of Munich in Germany and who helped build the diffusion model that now powers Stable Diffusion.
The neural network is then trained to reverse that process and predict what the less pixelated version of a given image would look like. The upshot is that if you give a diffusion model a mess of pixels, it will try to generate something a little cleaner. Plug the cleaned-up image back in, and the model will produce something cleaner still. Do this enough times and the model can take you all the way from TV snow to a high-resolution picture.
AI art generators never work exactly how you want them to. They often produce hideous results that can resemble distorted stock art, at best. In my experience, the only way to really make the work look good is to add descriptor at the end with a style that looks aesthetically pleasing.
~Erik Carter
The trick with text-to-image models is that this process is guided by the language model that’s trying to match a prompt to the images the diffusion model is producing. This pushes the diffusion model toward images that the language model considers a good match.
But the models aren’t pulling the links between text and images out of thin air. Most text-to-image models today are trained on a large data set called LAION, which contains billions of pairings of text and images scraped from the internet. This means that the images you get from a text-to-image model are a distillation of the world as it’s represented online, distorted by prejudice (and pornography).
One last thing: there’s a small but crucial difference between the two most popular models, DALL-E 2 and Stable Diffusion. DALL-E 2’s diffusion model works on full-size images. Stable Diffusion, on the other hand, uses a technique called latent diffusion, invented by Ommer and his colleagues. It works on compressed versions of images encoded within the neural network in what’s known as a latent space, where only the essential features of an image are retained.
This means Stable Diffusion requires less computing muscle to work. Unlike DALL-E 2, which runs on OpenAI’s powerful servers, Stable Diffusion can run on (good) personal computers. Much of the explosion of creativity and the rapid development of new apps is due to the fact that Stable Diffusion is both open source—programmers are free to change it, build on it, and make money from it—and lightweight enough for people to run at home.
Redefining creativity
For some, these models are a step toward artificial general intelligence, or AGI—an over-hyped buzzword referring to a future AI that has general-purpose or even human-like abilities. OpenAI has been explicit about its goal of achieving AGI. For that reason, Altman doesn’t care that DALL-E 2 now competes with a raft of similar tools, some of them free. “We’re here to make AGI, not image generators,” he says. “It will fit into a broader product road map. It’s one smallish element of what an AGI will do.”