I have recently tried out Stable Diffusion, a text-to-image deep learning model. This post contains various images that Stable Diffusion generated from my text and image inputs.
Input: a high tech dystopian city in the sky, details, high resolution
I personally like image 2 and 4 but to be honest none of them are particularly surprising. The input is very standard and there are many similar AI generated pictures on the internet.
Just out of curiosity I added the word "slum" and upscaled it to see what happens. This was the result.
Input: religion parody
This one is quite funny, it probably made a meme because of the keyword "parody". Although I have no idea what this meme could possibly mean you can see a lot of christian elemnts in it. The text just seems to be gibberish.
Image 1: Mix between the pope and Bernie Sanders?
Image 2: An orthodox priest
Image 3: A poor person or maybe a monk?
Image 4: An angel
Image 5: Maybe the white gown or the beard are supposed to look religous
Image 6: Ugly Mary?
Input 1: male brad pitt
Input 2: female angelina jolie
This is just me having some more fun. I was actually surprised how well the faces turned out. As you will see in the next images human anatomy is not always easy.
Input: old person plays on wooden grand piano, nostalgic
The most egregious flaw must be the hands. They resemble more of feet than hands. There are serveral more issues like the piano keys, the legs and it seems like the "nostalgic" keyword was forgotten.
Input: pianist plays on brown grand piano, concert
My first attempt in order to get a better result was to input words that are more context related. This image is already a lot closer to what I imagined and although you can recognize the hands they look very wrong. The next image has been generated using the image to image method and with only one input word "hands".
Input: hands
The idea was to only focus on one thing. Now the right hand has something like a thumb and the left hand has 3.5 fingers instead of 3. But to be honest the left hand looks more wrong in this image.
Input: pianist, piano, hand, looking down, detailed, detailed skin, detailed face, relaxed, light, shadow, photo, symmetrical ears and face shape, symmetrical circular eyes reflecting the environment, natural color
This input has yielded the best result so far. It describes in detail what the image should include and the negative prompt includes about as many specifics. But it is still far from perfect. The relfection of the keys is the other way around and the lid is deformed. I guess the more complex an activity is the harder it is to get it right. I also think it does not help that we are very familiar with human anatomy and its depiction so that any oddness will immediately stand out.
Next I tried to generate something simpler as a comparison. Take a look at the following four images of just a hand and people waving respectively. I am not going to include the input for each picture but I have used this Reddit post as a reference.