To quote Wikipedia (here): "Stable Diffusion is a deep learning, text-to-image model released in 2022. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.". This of course sounds nice, but what makes it special and how can you use it?
What makes Stable Diffusion special?
The dataset
The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. These datasets have been scraped from the web and are available for download (here). This sets Stable Diffusion apart from for example DALL-E and Midjourney where the datasets are not publicly available.
The model
The model is publicly available here and here. Again in contrast to most of the competition. This means you can do things like add additional material for training (such as with Dreambooth) to for example change the context of an object. You can also disable the NSFW check and disable adding of a hidden watermark by the Stable Diffusion software.
Running locally
Also there is a large community around Stable Diffusion which create tools around the model such as Stable Diffusion WebUI or the previously mentioned Dreambooth. Since the model is publicly available, you can also run it yourself on your laptop for free and you don't need to depend on the services of a third party to offer this as a SaaS solution.
Features
Text to image generation
You can use a prompt to indicate what you want to have created. You can give weights to specific words in the prompt. The weight can also be negative for things you don't want to see in your output. The prompt usually contains things like the object or creature you want to see and the style. For example the below image I generated for my daughter of 5 years old;
The prompt used for the above image was;
a beautiful cute fluffy baby animal with a fantasy background, style of kieran yanner, barret frymire, 8k resolution, dark fantasy concept art, by Greg Rutkowski, dynamic lighting, hyperdetailed, intricately detailed, trending on Artstation, deep color, volumetric lighting, Alphonse Mucha, Jordan Grimmer
Image to text / CLIP interrogation
You can ask the model what it sees in a picture so you can use this text to generate similar images. This is called CLIP interrogation and can be done for example here.
Inpainting
You can replace a part of an image with something else. For example in the below image I've replaced the dog on the bench with a cat (I prefer cats).
Outpainting
You can ask the model to generate additional areas around an existing image. For example below is a picture of me. I asked Stable Diffusion to generate a body below my head.
There is even a complete web interface Stable Diffusion Infinity to help you do this on a canvas;
Upscaling
- You can alter copyrighted material, remove watermarks, upscale thumbnails or low resolution photos. Make variations which are hard to trace back to the original. This allows a person to circumvent certain online protections of images.
- You can create fake news. For example create a photo of a large audience at Trump's inauguration.
- You can use the style of artists and their names without permission to create works of art and then compete with these same artists using these generated works. It is also currently not easy to opt-out of AI models as an artist in order to protect your work and style. You can imagine artists are not happy about this.
- It becomes easy to generate NSFW material (Google for example Unstable Diffusion). This can be abused by for example using someones Facebook pictures as base material without their permission.
- Common things are easy, uncommon things not
Less common poses (e.g. hands) and less common or highly detailed objects (e.g. crossbow) - Resolution
512 x 512 is default and the resolution the model (SD 1.5) works best at, can do multiples of 64. E.g. 578, 640, 704. Stable Diffusion 2.1 works at 768 x 768 resolution - Requires good graphics card.
E.g. Nvidia 4Gb absolute minimum, 8Gb preferable (or cloud, Google Colab) - Generation takes time and requires patience
It can take hours to generate (multiple variants of) images when running locally
- Setting up your environment requires some knowledge.
- Tweaking your generation configuration is not straightforward and requires you to understand a bit of what is actually happening.
- Generating prompts which create nice images is not as straightforward as you might expect. For example you need to know which artists create the style you want to generate images of. Also there are words which help such as 'high resolution' and negative prompts such as 'draft'. Knowing which words to use plays a major part in generating good images.
- Establishing a workflow is important. First generation, next inpainting, next upscaling is a general way to go about this. Especially the inpainting phase takes a lot of time.
No comments:
Post a Comment