We’ve trained a neural network called DALL·E that creates images from text captions for a wide range of concepts expressible in natural language.
We find that, just as a large transformer model trained on language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples.