It's been a month and I'm still consumed by Stable Diffusion, AI art.
The number of parameters you can apply to a prompt to impact the outcome is absolutely mind blowing.
Sure you could type "a picture of a monkey wearing lipstick" and get a result that would be a monkey wearing lipstick, but there are more combinations than stars in the Milky Way to impact how that generated image ends up looking.
I'm still getting a grasp on prompts. AI art may end "traditional" artist careers but there is a whole other set of skills used to obtain the desired generated image in AI art. I bet "Prompt artist" becomes a common term.
After a month, I've finally worked out a decent prompt to give a photo realistic image. First tip, don't use "realistic" in the prompt.
using the monkey wearing lipstick would be:
Code: Select all
(8k uhd), RAW Photo, monkey wearing lipstick, detailed skin, quality, sharp focus , tack sharp, Fujifilm XT3, crystal clear
focusing on camera type images, because art/style is whole other thing, you can use pretty much all the terms you use with regular photography and they impact the image...example:
Film type (Kodak gold 200, Portra 400, fujifilm superia), DSLR, Camera model, Hasselblad, Film Format or Lens type (35mm, 70mm IMAX), (85mm, Telelens etc.), Film grain
Then there are examples of details and lighting effects:
accent lighting, ambient lighting, backlight, blacklight, blinding light, candlelight, concert lighting, crepuscular rays, direct sunlight, dusk, Edison bulb, electric arc, fire, fluorescent, glowing, glowing radioactively, glow-stick, lava glow, moonlight, natural lighting, neon lamp, nightclub lighting, nuclear waste glow, quantum dot display, spotlight, strobe, sunlight, ultraviolet, dramatic lighting, dark lighting, soft lighting, gloomy
highly detailed, grainy, realistic, unreal engine, octane render, bokeh, vray, houdini render, quixel megascans, depth of field (or dof), arnold render, 8k uhd, raytracing, cgi, lumen reflections, cgsociety, ultra realistic, volumetric fog, overglaze, analog photo, polaroid, 100mm, film photography, dslr, cinema4d, studio quality
And camera view etc:
ultra wide-angle, wide-angle, aerial view, massive scale, street level view, landscape, panoramic, bokeh, fisheye, dutch angle, low angle, extreme long-shot, long shot, close-up, extreme close-up, highly detailed, depth of field (or dof), 4k, 8k uhd, ultra realistic, studio quality, octane render
These settings actually matter.
still focusing just on generating photo type images, all of those prompts (just a sampling there are many many more) all influence the crazy number os other things that impact the end result:
Model: 100's of them now, maybe thousands. This is the base of what all the images are trained on. There are more realistic focused models, more anime type models, artistic types, etc. But there are also some really good models that can manage all types with good output.
CFG - Classifier Free Guidance: lower the number, more freedom the AI has to render the image, higher the setting, the stricter the AI adheres to the prompt you input. Typical settings from 3-15, every setting in between produces a different outcome.
Steps: the higher the number, the more time the AI spends diffusing the image. To a point, the more step, the higher the quality of image output (depending on samping method used)
Clip Skip: another setting that totally changes the image. Clip 1 in general best for more realistic images, Clip 2 better for art/animated.
Sampling Method: long technical explanation for it, but it impacts the style, colors, etc of the end image.
Seed: determines randomness of generated image. Usually you set to random (-1) but if you find an image you like and want to make more like it with only small changes, you keep the seed # which will keep the general layout of the image.
Text to image: this is where you put in your initial prompt and create something
Image to Image: after you find something you like, you send it to img2img and then start generating smaller changes until:
Inpaint: once you have something you really like but maybe the person has six toes or you want to add or take out something, you use this. Done with masking technique, there are various ways but basically if a hand looks screwed up (extra digits etc) you mask the hand and then you only regenerate that part of the image until you get what you want, the rest of the image stays the same and the AI blends in the new result.
After all of that, you still have LORA, embedding, textual inversions, wildcards, control net, etc that you can use.
LORA are small files that are trained on something specific, like a person, an object, suite of armor, a style etc.
Say you do everything above, all those bazillion parameters but you want to have the same person in the image, you use a LORA with that person's trains features and add it to the prompt: <lora:chevy_chase:1> would give your image Chevy Chase's face (and maybe body etc depending on how it was trained) the 1 at the end is the weight. 0.5 would give the Model being used more flexibility inf integrating yhour LROA into the scene, 1.5 would force the image or make it more prominant.
Now with LORA, the first thing you think of is training a celebrity face or anyone's face, but you can train much more detailed things. One example I just saw on CivitAI was "bags under the eyes". if the normal "tired, sleepy, etc prompts are enough, someone trained a bunch of pictures of people with dark bags under their eyes, if I add that LORA along with the Chevy Chase Lora, he would have baggy eyes, and then you can adjust the amount of baggy eyes with that same 0.1-1.5 scale.
Obviously sexual positions are all LORAs but you can also use Control Net to specifically position your bodies etc.
TLDR to thise point: Stable Diffusion with (Automatic1111) is amazing. You can keep it simple and get decent results but there is also incredible depth to it.
I've been having fun with dynamic prompting/wildcards. (X/Y/Z Scrpt also really cool)
Since there are an insane amount of ways you can affect the end result, X/Y/Z lets you create grids, say enter a prompt, then give me a grid using various CFG settings, or steps, clip skip, etc, then also us these x number of models, then you see the same prompt with different outcomes and compare them. But you can also use PR to change the prompt so you for example, if you wanted to see what the different film (Kodak, Fuji etc) impace the prompt, you could use same prompt with only the film changing to create a grid sample result.
Another way to use wildcards is say you want to see what your favorite prompt looks like but painted but 100 different artists. You create a text file with those 100 artists, use the artist name as the wildcard, then generate 100 images, each one rendered based on the individual artist.
Another nice thing is Automatic1111 saves all the images from txt2img, img2img, and also the batches (say you generate 12 images at one time, it grates a thumbnail contact sheet of it). every one of those images has embedded it it, all the settings from the prompt used as well as the mode, cfg, seed, steps etc so if/when you go back and look at them, you can drag that image into Automatic1111's PNG Info tab and it will extract the info and you can send it straight to img2img, text2img, etc to work with...basically once you have something you like or something close to what you want from someone else, you can use that image as a starting point then start tweaking.