Multimodal Stable Diffusion Studio
All-in-one skill covering state, text, image, generation. Includes structured workflows, validation checks, and reusable patterns for ai research.
Stable Diffusion Studio -- Comprehensive Image Generation
Overview
A complete skill for AI image generation using Stable Diffusion with the HuggingFace Diffusers library. Covers text-to-image, image-to-image, inpainting, ControlNet conditioning, LoRA adapters, SDXL, and SD 3.0 pipelines. Stable Diffusion operates in latent space using a three-component architecture -- a text encoder (CLIP/T5) for understanding prompts, a UNet or Transformer for iterative denoising, and a VAE for encoding/decoding between pixel and latent space. This skill provides production-ready patterns for generating, transforming, and controlling image outputs.
When to Use
- Generating images from text descriptions with full control over the process
- Performing image-to-image style transfer, enhancement, or transformation
- Inpainting masked regions with context-aware content generation
- Using ControlNet for structure-preserving generation (edges, poses, depth maps)
- Applying LoRA adapters for domain-specific styles or subjects
- Building custom image generation pipelines or creative tools
- Need local, self-hosted image generation without API dependencies
Quick Start
# Install core dependencies pip install diffusers transformers accelerate torch # Optional: memory-efficient attention pip install xformers
from diffusers import DiffusionPipeline import torch # Load Stable Diffusion pipeline pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, ) pipe.to("cuda") # Generate an image image = pipe( "A serene mountain landscape at golden hour, highly detailed, 8k", num_inference_steps=50, guidance_scale=7.5, ).images[0] image.save("landscape.png")
Core Concepts
Pipeline Architecture
Text Prompt ──► Text Encoder (CLIP/T5) ──► Text Embeddings
│
Random Gaussian Noise ──► Denoising Loop ◄── Scheduler (step control)
│
Denoised Latent Tensor
│
VAE Decoder ──► Final Image (pixel space)
The scheduler controls how noise is progressively removed over N steps. Different schedulers trade off speed, quality, and determinism.
Available Pipelines
| Pipeline Class | Model | Resolution | Use Case |
|---|---|---|---|
StableDiffusionPipeline | SD 1.5 | 512x512 | Fast, broad ecosystem |
StableDiffusionXLPipeline | SDXL | 1024x1024 | Higher quality, detail |
StableDiffusion3Pipeline | SD 3.0 | 1024x1024 | Latest architecture |
FluxPipeline | Flux | 512-1024 | Flow matching models |
StableDiffusionImg2ImgPipeline | SD 1.5 | Variable | Transform images |
StableDiffusionInpaintPipeline | SD 1.5 | Variable | Fill masked regions |
StableDiffusionControlNetPipeline | SD 1.5 | 512x512 | Structure control |
SDXL -- Higher Quality Generation
from diffusers import AutoPipelineForText2Image import torch pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", ) pipe.to("cuda") pipe.enable_model_cpu_offload() # Reduce VRAM usage image = pipe( prompt="Cyberpunk cityscape at night, neon lights reflecting on wet streets", negative_prompt="blurry, low quality, distorted, ugly", height=1024, width=1024, num_inference_steps=30, guidance_scale=7.0, ).images[0]
Schedulers -- Speed vs Quality
Schedulers control the denoising algorithm. Swapping schedulers requires zero retraining:
from diffusers import DPMSolverMultistepScheduler # Replace default scheduler with faster one pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) # Now generate with fewer steps for similar quality image = pipe("A portrait of a wizard", num_inference_steps=20).images[0]
| Scheduler | Typical Steps | Quality | Speed | Notes |
|---|---|---|---|---|
EulerDiscreteScheduler | 20-50 | Good | Medium | Solid default |
EulerAncestralDiscreteScheduler | 20-50 | Good | Medium | More variation |
DPMSolverMultistepScheduler | 15-25 | Excellent | Fast | Recommended |
DDIMScheduler | 50-100 | Good | Slow | Deterministic |
LCMScheduler | 4-8 | Good | Very Fast | Latent consistency |
UniPCMultistepScheduler | 15-25 | Excellent | Fast | Predictor-corrector |
Generation Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
prompt | Required | -- | Text description of the desired image |
negative_prompt | None | -- | What to avoid (artifacts, styles) |
num_inference_steps | 50 | 4-100 | Denoising iterations; more = better but slower |
guidance_scale | 7.5 | 1-20 | Prompt adherence; 7-12 is typical |
height | 512/1024 | Multiple of 8 | Output height in pixels |
width | 512/1024 | Multiple of 8 | Output width in pixels |
num_images_per_prompt | 1 | 1-8 | Batch size per generation |
generator | None | -- | Torch generator for reproducibility |
Reproducible Generation
import torch generator = torch.Generator(device="cuda").manual_seed(42) image = pipe( prompt="A cat wearing a top hat, oil painting style", generator=generator, num_inference_steps=50, ).images[0] # Same seed + same parameters = identical output
Image-to-Image Transformation
from diffusers import AutoPipelineForImage2Image from PIL import Image import torch pipe = AutoPipelineForImage2Image.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, ).to("cuda") init_image = Image.open("photo.jpg").resize((512, 512)) image = pipe( prompt="A watercolor painting of the scene, artistic, vibrant", image=init_image, strength=0.75, # 0.0 = no change, 1.0 = complete regeneration num_inference_steps=50, ).images[0]
Inpainting
from diffusers import AutoPipelineForInpainting from PIL import Image import torch pipe = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, ).to("cuda") image = Image.open("room.jpg") mask = Image.open("mask.png") # White pixels = regions to repaint result = pipe( prompt="A modern leather sofa, interior design photography", image=image, mask_image=mask, num_inference_steps=50, guidance_scale=7.5, ).images[0]
ControlNet -- Structure-Preserving Generation
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch # Load ControlNet for Canny edge conditioning controlnet = ControlNetModel.from_pretrained( "lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16, ) pipe = StableDiffusionControlNetPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, ).to("cuda") # Prepare control image (Canny edges) import cv2 import numpy as np from PIL import Image input_img = cv2.imread("building.jpg") edges = cv2.Canny(input_img, 100, 200) control_image = Image.fromarray(edges) image = pipe( prompt="A futuristic glass building, photorealistic, 4k", image=control_image, num_inference_steps=30, ).images[0]
ControlNet Models
| Model | Conditioning | Use Case |
|---|---|---|
control_v11p_sd15_canny | Edge maps | Preserve structural outlines |
control_v11p_sd15_openpose | Pose skeletons | Maintain human poses |
control_v11f1p_sd15_depth | Depth maps | 3D-aware generation |
control_v11p_sd15_normalbae | Normal maps | Surface detail control |
control_v11p_sd15_mlsd | Line segments | Architectural geometry |
control_v11p_sd15_scribble | Rough sketches | Sketch-to-image conversion |
LoRA Adapters
from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, ).to("cuda") # Load LoRA weights (small adapter files, typically 2-50 MB) pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors") image = pipe("A portrait in the trained style").images[0] # Control LoRA influence strength pipe.fuse_lora(lora_scale=0.8) # 0.0 = no effect, 1.0 = full effect # Unload LoRA when no longer needed pipe.unfuse_lora() pipe.unload_lora_weights()
Memory Optimization
# Option 1: CPU offloading (lowest VRAM, ~3 GB) pipe.enable_model_cpu_offload() # Option 2: Sequential CPU offloading (even lower VRAM but slower) pipe.enable_sequential_cpu_offload() # Option 3: Attention slicing (reduces peak memory) pipe.enable_attention_slicing() # Option 4: VAE slicing for large images pipe.enable_vae_slicing() # Option 5: xFormers memory-efficient attention pipe.enable_xformers_memory_efficient_attention() # Option 6: FP16 VAE tiling for very large images pipe.enable_vae_tiling() # Combine for maximum savings pipe.enable_model_cpu_offload() pipe.enable_attention_slicing() pipe.enable_vae_slicing()
Best Practices
- Start with
DPMSolverMultistepScheduler-- It delivers excellent quality in 15-25 steps, often 2-3x faster than the default scheduler with comparable output. - Use negative prompts consistently -- Always include quality-oriented negatives like
"blurry, low quality, distorted, bad anatomy, ugly"to steer outputs away from common artifacts. - Match resolution to model training -- SD 1.5 was trained at 512x512, SDXL at 1024x1024. Generating at other resolutions can produce artifacts or duplicated subjects.
- Guidance scale sweet spot is 7-10 -- Below 5 produces generic results; above 12 causes over-saturation and artifacts. The 7-10 range balances creativity with prompt adherence.
- Seed-lock for iteration -- Fix the random seed when iterating on prompts so that visual changes are only due to prompt edits, not random noise differences.
- Enable
model_cpu_offloadon consumer GPUs -- This keeps only the active pipeline stage on GPU, reducing VRAM from 10+ GB to ~3 GB with minimal speed penalty. - Batch generation with
num_images_per_prompt-- Generating 4 images per call is more efficient than 4 separate calls because the text encoder runs only once. - Use ControlNet for structural consistency -- When you need the output to follow a specific layout, pose, or depth map, ControlNet provides far more control than prompt engineering alone.
- Validate LoRA compatibility -- LoRA adapters trained on SD 1.5 do not work with SDXL and vice versa. Always check the base model version before loading adapters.
- Pre-compute and cache image embeddings -- For image-to-image pipelines processing the same source image with different prompts, encode the image once and reuse.
Troubleshooting
Generated images have duplicated subjects or limbs: This typically occurs when generating at a resolution the model was not trained for. Use 512x512 for SD 1.5 or 1024x1024 for SDXL.
CUDA out of memory errors:
Enable pipe.enable_model_cpu_offload() and pipe.enable_attention_slicing(). If still failing, switch to a smaller model variant or reduce batch size to 1.
Images look over-saturated or burned:
Lower the guidance_scale from 7.5 to 5.0-6.0. High guidance values push the model too hard toward the prompt, causing color clipping.
LoRA weights produce no visible effect:
Verify the LoRA was trained for the same base model. Check that weight_name matches the actual filename. Try increasing lora_scale to 1.0 to confirm the weights loaded.
Inpainting bleeds outside the mask boundary: Ensure the mask is a clean binary image (pure white for inpaint regions, pure black for preserve). Feathered or gray mask edges cause bleeding artifacts.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.