Stable Diffusion and AI stuff

Winnow · Post by **Winnow** » May 31, 2025, 2:42 am

AI is progressing at a break neck speed still. There were at least 5 new models that came out ranging from video, image, audio to text models. All nice advancements. Huge progress in AI voice/emotion for local AI.

Just a little while ago I was gushing over OpenAI's image generator that allowed you to edit images using text commands and keep the image mostly the same.

Fast forward to yesterday and Black Forest Labs released FLUX Kontext

Basically it allows you to do all sorts of things with an image, keeping the context of the image. The video above is worth a watch. Right now it's available only in the Pro/Ultimate models but a local FLUX-Kontext DEV model (Like FLUX DEV before it) will be released allowing this to be done locally.

It can modify a scene, extract things from an image, Change text in an image, keeping the same style but with new text, if you check out the video, it can do things like "Take the pattern off the store window and make it a tattoo on a man's back"

It will also allow you to expand and fill (InPainting/OutPainting) an image

According to the charts near the end of the video. Even the Local FLUX Kontext DEV version is better than the commercial competition like ChatGPT, etc. in some areas. It's also fast. Looks like around 10 seconds for image2image using Kontext DEV. version and less than 3 seconds for the pro (vs almost a minute for gpt-image)

The local version won't be the "best" version but what matters is that it can run locally.

Winnow · Post by **Winnow** » June 19, 2025, 8:25 pm

Quick AI update before going traveling.

AI is amazing and hasn't stopped progressing at all. Always same or faster increase in AI capabilities.

As for video. The developments have come in the way of making the same videos faster with less VRAM. I can now make quality 7 second videos in about 90 seconds which means those with less VR can make them but slower processing time. Image to video is almost like magic. It still amazes me how the models figure out the movement of everything, hair, cloth, wind blown stuff, etc etc. They key here is the ability of the model to retain the facial features from a single still image and create a consistent video from it...and it's getting really good at that.

I won't post any guides. It's still takes some effort and not worth trying to show how do it here for now.

I've been focusing on audio AI. Chatterbox is amazing. This extended version in particular:

https://github.com/petermg/Chatterbox-TTS-Extended

You can use any 8-20 seconds of any voice and it will exactly reproduce that voice along with a setting on how much emotion you want.

It's really fucking good. This is "zero-shot" meaning you don't need to train it. Just give it the 8-20 second audio sample and you're off to the races.

Here are some examples of same voice and then the zero-shot cloned voices with various levels of Exaggeration:

https://resemble-ai.github.io/chatterbox_demopage/

The extended version (more capabilities) allows you to stitch together segments. I've been making around 15 minute stories but I think you can go to around 30 minutes without issue.
------------------------------

Another really cool feature of the extended version is voice conversion. It allows you to take an audio file and convert the voice to the sample voice you provide. It changes the voice but not the accent etc.

For example. I took the first chapter of Harry Potter's first audio book. I provided a sample of a Japanese Anime actress. I converted the entire first chapter in about 90 seconds (RTX4090). The result was the female actress's voice but she sounded British and used the exact tone/inflection of the original male narrator. Same high quality as the ''professional" narrator.

What this means is people like Spang can use their favorite Trans voice actor and convert any audio so he hears into that special Trans actor's voice that he holds dear. For the rest of us, it means that if you have an audio book and you can't stand the voice of that "professional" voice actor, you can convert the entire book into a voice you do like with zero loss in quality or voice emphasis/emotion etc.

If you also don't like their tone/emotion, you could completely convert it using just the text and setting the temperature/CFG, etc for the voice of your choice and just create an entirely new one.

For quality control, you can have it generate multiple times each sentence and then it has Whisper check each line for accuracy and it will regenerate the line if it doesn't sound right. That takes a lot more time but if you really want really high quality, it's capable.
---------------------

This is a huge step. Voice AI is now able to reproduce emotion and it's on the cusp of making it easy to break down a book/script into the individual voices and have the AI use as many voices as you want for the book....way better than those poor soon to be out of work voice actors can do. How many times have you (except Spang/trans stuff) been disgusted by a voice actor trying to emulate the opposite sex of what they are? No more. You will very soon be able to have high quality, emotional setting voices for Narration and every character in the book. Badass. The tech is already there but it will be another month or two before it's as easy as Chatterbox above to do it. They're working on the model already.

I attached a zip file with an mp3 of this post with Graham Hancock's voice. (can't attach mp3s directly) It's always more interesting if Graham Hancock is saying it!

Note: I upped the exaggeration just a tad so Graham is speaking a little fast : )

It's worth a listen! Listen to it as you read along so you can evaluate it's quality. I did change Baddass to "It's badass bitches!) in the audio version and also left the sentence " It's still takes some effort and not worth trying to show how do it here for now." in so that's on me, not the AI for repeating what I said.

Winnow · Post by **Winnow** » July 29, 2025, 3:27 pm

I've been messing around with the latest local video generator WAN2.2 It's really good.

Spang photobombs EQ Cover Lady

eqcover.png

From starting image of this. I used Kontext to make it realistic, then used WAN2.2 to create these. Note: I didn't directly use the first image. I could have kept her pose and clothes more accurate. Her side stance actually looks kinda strange to me. Sony probably didn't pay for a good artist!

eqphotob2.gif

click image to see gif animation

eqphotob1.gif

Spang in costume

A sorceress. A man leans in from the side and makes the peace sign. The sorceress hits the man with her staff knocking him out of the scene

Unfortunately these are high quality videos converted to .gif and then optimized gifs to be able to post on this forum. The original quality is really good.

It does text, I didn't put effort into this so didn't bother correcting Spang's t-Shirt.
----------------------
time:

less than 60 seconds to make EC Covergirl Photorealistic and remove the red tail at her feet using FLUX Kontext

less than 2 minutes to create each video.
----------------------------
These are bad examples of how good WAN2.2 is but they were quick test and made me laugh.

Key things with WAN2.2 vs WAN2.1 is it's a dual model system. first model sets up the scene and motion, second model refines it. This allows for higher quality videos without requiring more VRAM. I still have to use all the tricks available to make it process quickly with 24GB VRAm but if you want to wait like 30 minutes, you can generate on as little as 8 VRAM. 2 minutes = fun but if you really had something you wanted to animate or make a video of at least you can with lower VRAM and some hoop jumping.

WAN2.2 is getting really good at following prompts. It also makes very few clipping errors (like the one in second gif where her staff shows through her cloak)

TL;DR the EQ Cover lady doesn't like to be photobombed

Winnow · Post by **Winnow** » August 9, 2025, 7:31 pm

After a week or so of WAN2.2 being refined in workflows etc, it's safe to say this model is amazing.

I understand our forum is working of world war II or early 2000's tech so I can't upload short mp4 videos as examples so I'll just say what makes it great.

WAN2.2 is incredibly good at understanding real world physics. When you start a video with a still image, it understands shadows, wind physics, body physics extremely well.

It also can figure out how to make image to image (first frame/last frame provided) videos flow.

For example. The best example is a photo shoot because it's easy to imagine. You have a subject doing various poses. If you take one picture of a pose and another pose even is it's extremely different (standing/sitting/holding a prop, etc) WAN2.2 will smoothly transition a the video between the two images to make it seamless.

It understands as an example. say a woman's hair is down in one image but it's in a ponytail in the next. WAN2.2 figures all this out and has the woman pull her hair up into a ponytail and it all looks natural.

A more extreme example. Say her hair is dry in one picture and wet in the next. WAN2.2 will figure out a way for that hair to get wet, maybe random person pours a bucket of water on her head, etc (unless you actually prompt how you want it to get wet, WAN2.2 will make something up).

Keeping with a photo shoot. (for smut's sake, imagine it's a playboy centerfold pictorial). Say there's 10 photos. You can set first/last frame continuously between those 10 photos and say 6 second video steps). Run your workflow and you end up with a 60 second, seamless video of that photo shoot with the model smoothly moving around and posing in all of those 10 positions.

If the first/last frame is too extreme (say on beach in first image and on snow covered mountain top in second), WAN2.2 will still make a smooth fade transition between the two but of course can't accomplish that kind of change in 6 seconds.

That was a first/last frame example. You can also create a endless loop where you make multiple steps with different prompts.

Example:

Prompt 1: A women wearing a Viking cosplay outfit waves at the viewer

Prompt 2: The woman walks over to a bar and picks up a drink

Prompt 3: The woman rips off her clothes and a donkey mounts her from behind (lora required)

Prompt 4: Zoom in on donkey's face smoking a cigarette

So, in the above scenario, you don't need to say "wearing a viking cosplay outfit" in the second prompt because the last frame of the first video is used to start the second prompt. In the third prompt, WAN2.2 doesn't understand the concept of a donkey fucking a woman so you would add a LORA to that step so WAN2.2 can accomplish the task.

The above scenario would be 24 seconds long. Say you wanted to make the donkey scene longer. you would add a prompt "a donkey is having sex with a woman" inserted after prompt 3 and that scene would play out for 12 seconds. You wouldn't repeat prompt 3 because her clothes are already ripped off and she's already mounted.

Obviously the above is hypothetical since there's no giraffe involved but you get the point.

On a more fun side of things, you can take frames from movies and change the scene.

These aren't the best examples (Star Wars one is pretty funny) but gives you and idea of how good WAN2.2 looks.

Since WAn2.2 is a two models High Noise/Low Noise, it takes a little longer to make good videos. That 24 second video would take me about 8 minutes with 24VRAM. way longer with less VRAM. Do not buy anything at this point with less than 24 GB Vram. guaranteed regret. At the very worst, 24B Vram will eventually be the low point where solutions are made for it. 5090 with 32GB is iffy. I'm still on the fence. I will probably wait for 48GB VRam option but if the 5090 drops within a $200 or so of retail I might buy it.

Another amateur example of animating still images. from the comments :16GB VRAM, q5 gguf model
each 5 second clip take around 9-10 min.

It's not a continuous loop example like I described above so there are cuts. Plus it's slow motion which is kind of annoying.

on a 4090 that would take about 2 minutes a 5 second clip. WAN2.2 will keep the style (realistic or whatever art style you throw at it) without needing a LORA.

Example of text to video with no initial image.

WAN2.2 is impressive. We are getting closer. You can now start to mess around with scenes of movies etc.

Aslanna · Post by **Aslanna** » August 16, 2025, 8:58 pm

Note: I didn't directly use the first image. I could have kept her pose and clothes more accurate. Her side stance actually looks kinda strange to me. Sony probably didn't pay for a good artist!

You could have spent 5 seconds to check because now you look like a dumbo. It was done by Keith Parkinson. Most would consider him to be a "good artist."

Interesting trivia from that wikipedia article was that Brad McQuaid gave Keith's eulogy.

Winnow · Post by **Winnow** » August 17, 2025, 2:32 pm

Aslanna wrote: ↑August 16, 2025, 8:58 pm
Note: I didn't directly use the first image. I could have kept her pose and clothes more accurate. Her side stance actually looks kinda strange to me. Sony probably didn't pay for a good artist!
You could have spent 5 seconds to check because now you look like a dumbo. It was done by Keith Parkinson. Most would consider him to be a "good artist."

Interesting trivia from that wikipedia article was that Brad McQuaid gave Keith's eulogy.

Not to besmirch the name of the artist because everything art related is based on your personal preference. The EQ cover is not good art (to me). It's not horrible but not good. The hands! If it was AI you'd be complaining about the hands but because it's a "human" I guess they get a pass along with many many other bad hand artists. (refer to my post about comic book artists and hands with plenty of examples)

Aslanna · Post by **Aslanna** » August 17, 2025, 7:08 pm

The comment you made, and what I was responding to, was about the artist ("Sony probably didn't pay for a good artist!") and not the art. I wasn't commenting on that specific piece of art since art appreciation, as you state, is subjective. You may like it.. You may hate it.. But that doesn't necessarily make the artist bad. It's possible to separate the two.

Winnow · Post by **Winnow** » August 17, 2025, 8:56 pm

I still liked the original loading screen for EQ. You can certainly pay for art and have the result be bad. I agree you can separate Sony being cheap from making bad decisions.

Sony (whatever division owned EQ) wasn't good at advertising. They got jack-stomped by WOW when it came out even though EQ was the better game. Well it was before some of the later expansions. Part of that was whiny people not liking more challenging games. EQ probably should have adapted though and figured out why they were losing people to the easier game. I liked the raids but did fall asleep multiple times due to their length.

Winnow · Post by **Winnow** » November 5, 2025, 2:02 am

EQ1.png

OCR LLMS are getting really good. I used qwen3-v1-4b for this.

I loaded the model in LM Studio (the Q4_K-M Quant fits in under 6GB of VRAM)

took like 5 seconds for it to analyze and provide the description. while the larger LLMs get better and better, so do the smaller LLMs.

That's quite detailed description for a tiny LLM. It called her a warrior but did describe her actual appearance, staff, etc right. Guessed right on the dragon tail.

It knew it was the official cover art for EverQuest.

Winnow · Post by **Winnow** » November 21, 2025, 3:29 pm

I'm all about local AI but sometimes it's worth mentioning how ridiculously good the large cloud image generators are getting.

Nano Banana Pro was released a few days ago. You can actually try use it free on the lower resolutions.

I would encourage you to check out the linked video to see what image generators can do now. It's 34 minutes long but highly bookmarked so you can scroll across the timeline/thumbs to see various examples.

I'm actually surprised google allows real celebrities to be generated. I guess we have Elon Musk and Grok to thank for that. Fuck that guy for forcing Google not to generate black WWII nazis and Native American US presidents. Haters gonna hate.

Specifically for real people check 5:34-9:17 times

It doesn't just generate images it analyzes them. You can upload a chart and say "change x numbers" It will not only correctly adjust the numbers but the chart itself which was just an image to begin with.

"Disassemble Gundam into model parts" was impressive as well.

Give it a geolocation and it will generate an image based on what's at that location

Give it a floorplan and it will create a photo based on the blueprints.

Colorize and translate pages of anime/comics etc.

A couple things to note. 2 years ago we couldn't even do video and images were basic and couldn't be edited for consistency, bad hands etc. Bad hands isn't an issue. You never see bad hands, like ever in new image models and videos.

This guy points out the remaining issues in some areas so Spang can march down the street with his buddies declaring that Nano Banana Pro still can't take the entire ingredients and fine print from a tube of toothpaste and integrate it into an image! Take that AI!

Pro or con AI, it's worth checking out some of the frequent updates on what AI can do. Separate from this image editing part. Gemini 3's text model can do crazy shit fast like "Create a clone of Windows 11 including working versions of MS Paint, Word and Browser and make it HTML" It can do that in seconds from scratch, code everything and present you with what you asked. I already use apps that people created entirely in AI. Small ones like an HTML based video viewer where I can drag how every many videos I want to the window and it will sync them so I can compare between generations. AI (now if you're more savvy or later if you're a grandma) will make it so you don't need to look for apps that do what you want. You can ask AI to create exactly what you need customized as you see fit. But Spang doesn't like that. He wants you to be stuck doing what only certain individuals are trained for doing. If you are 70 and want to explore nature but can't fucking walk, Spang says too fucking bad, that's only for healthy people. No way an AI robot should be able to assist an elderly person walk and see nature. And this is where Spang will start spouting out exceptions to the rule, just like all the crazy religions have done for centuries as technology advances, what they used to kill you for, they integrate now and say its "god's will".

Man it's fun to rant!

VeeshanVault

Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff

Re: Stable Diffusion and AI stuff