M

Multimodal Stable Diffusion Studio

All-in-one skill covering state, text, image, generation. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Stable Diffusion Studio -- Comprehensive Image Generation

Overview

A complete skill for AI image generation using Stable Diffusion with the HuggingFace Diffusers library. Covers text-to-image, image-to-image, inpainting, ControlNet conditioning, LoRA adapters, SDXL, and SD 3.0 pipelines. Stable Diffusion operates in latent space using a three-component architecture -- a text encoder (CLIP/T5) for understanding prompts, a UNet or Transformer for iterative denoising, and a VAE for encoding/decoding between pixel and latent space. This skill provides production-ready patterns for generating, transforming, and controlling image outputs.

When to Use

  • Generating images from text descriptions with full control over the process
  • Performing image-to-image style transfer, enhancement, or transformation
  • Inpainting masked regions with context-aware content generation
  • Using ControlNet for structure-preserving generation (edges, poses, depth maps)
  • Applying LoRA adapters for domain-specific styles or subjects
  • Building custom image generation pipelines or creative tools
  • Need local, self-hosted image generation without API dependencies

Quick Start

# Install core dependencies pip install diffusers transformers accelerate torch # Optional: memory-efficient attention pip install xformers
from diffusers import DiffusionPipeline import torch # Load Stable Diffusion pipeline pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, ) pipe.to("cuda") # Generate an image image = pipe( "A serene mountain landscape at golden hour, highly detailed, 8k", num_inference_steps=50, guidance_scale=7.5, ).images[0] image.save("landscape.png")

Core Concepts

Pipeline Architecture

Text Prompt ──► Text Encoder (CLIP/T5) ──► Text Embeddings
                                                  │
Random Gaussian Noise ──► Denoising Loop ◄── Scheduler (step control)
                              │
                     Denoised Latent Tensor
                              │
                     VAE Decoder ──► Final Image (pixel space)

The scheduler controls how noise is progressively removed over N steps. Different schedulers trade off speed, quality, and determinism.

Available Pipelines

Pipeline ClassModelResolutionUse Case
StableDiffusionPipelineSD 1.5512x512Fast, broad ecosystem
StableDiffusionXLPipelineSDXL1024x1024Higher quality, detail
StableDiffusion3PipelineSD 3.01024x1024Latest architecture
FluxPipelineFlux512-1024Flow matching models
StableDiffusionImg2ImgPipelineSD 1.5VariableTransform images
StableDiffusionInpaintPipelineSD 1.5VariableFill masked regions
StableDiffusionControlNetPipelineSD 1.5512x512Structure control

SDXL -- Higher Quality Generation

from diffusers import AutoPipelineForText2Image import torch pipe = AutoPipelineForText2Image.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", ) pipe.to("cuda") pipe.enable_model_cpu_offload() # Reduce VRAM usage image = pipe( prompt="Cyberpunk cityscape at night, neon lights reflecting on wet streets", negative_prompt="blurry, low quality, distorted, ugly", height=1024, width=1024, num_inference_steps=30, guidance_scale=7.0, ).images[0]

Schedulers -- Speed vs Quality

Schedulers control the denoising algorithm. Swapping schedulers requires zero retraining:

from diffusers import DPMSolverMultistepScheduler # Replace default scheduler with faster one pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) # Now generate with fewer steps for similar quality image = pipe("A portrait of a wizard", num_inference_steps=20).images[0]
SchedulerTypical StepsQualitySpeedNotes
EulerDiscreteScheduler20-50GoodMediumSolid default
EulerAncestralDiscreteScheduler20-50GoodMediumMore variation
DPMSolverMultistepScheduler15-25ExcellentFastRecommended
DDIMScheduler50-100GoodSlowDeterministic
LCMScheduler4-8GoodVery FastLatent consistency
UniPCMultistepScheduler15-25ExcellentFastPredictor-corrector

Generation Parameters

ParameterDefaultRangeDescription
promptRequired--Text description of the desired image
negative_promptNone--What to avoid (artifacts, styles)
num_inference_steps504-100Denoising iterations; more = better but slower
guidance_scale7.51-20Prompt adherence; 7-12 is typical
height512/1024Multiple of 8Output height in pixels
width512/1024Multiple of 8Output width in pixels
num_images_per_prompt11-8Batch size per generation
generatorNone--Torch generator for reproducibility

Reproducible Generation

import torch generator = torch.Generator(device="cuda").manual_seed(42) image = pipe( prompt="A cat wearing a top hat, oil painting style", generator=generator, num_inference_steps=50, ).images[0] # Same seed + same parameters = identical output

Image-to-Image Transformation

from diffusers import AutoPipelineForImage2Image from PIL import Image import torch pipe = AutoPipelineForImage2Image.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, ).to("cuda") init_image = Image.open("photo.jpg").resize((512, 512)) image = pipe( prompt="A watercolor painting of the scene, artistic, vibrant", image=init_image, strength=0.75, # 0.0 = no change, 1.0 = complete regeneration num_inference_steps=50, ).images[0]

Inpainting

from diffusers import AutoPipelineForInpainting from PIL import Image import torch pipe = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, ).to("cuda") image = Image.open("room.jpg") mask = Image.open("mask.png") # White pixels = regions to repaint result = pipe( prompt="A modern leather sofa, interior design photography", image=image, mask_image=mask, num_inference_steps=50, guidance_scale=7.5, ).images[0]

ControlNet -- Structure-Preserving Generation

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch # Load ControlNet for Canny edge conditioning controlnet = ControlNetModel.from_pretrained( "lllyasviel/control_v11p_sd15_canny", torch_dtype=torch.float16, ) pipe = StableDiffusionControlNetPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, ).to("cuda") # Prepare control image (Canny edges) import cv2 import numpy as np from PIL import Image input_img = cv2.imread("building.jpg") edges = cv2.Canny(input_img, 100, 200) control_image = Image.fromarray(edges) image = pipe( prompt="A futuristic glass building, photorealistic, 4k", image=control_image, num_inference_steps=30, ).images[0]

ControlNet Models

ModelConditioningUse Case
control_v11p_sd15_cannyEdge mapsPreserve structural outlines
control_v11p_sd15_openposePose skeletonsMaintain human poses
control_v11f1p_sd15_depthDepth maps3D-aware generation
control_v11p_sd15_normalbaeNormal mapsSurface detail control
control_v11p_sd15_mlsdLine segmentsArchitectural geometry
control_v11p_sd15_scribbleRough sketchesSketch-to-image conversion

LoRA Adapters

from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained( "stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, ).to("cuda") # Load LoRA weights (small adapter files, typically 2-50 MB) pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors") image = pipe("A portrait in the trained style").images[0] # Control LoRA influence strength pipe.fuse_lora(lora_scale=0.8) # 0.0 = no effect, 1.0 = full effect # Unload LoRA when no longer needed pipe.unfuse_lora() pipe.unload_lora_weights()

Memory Optimization

# Option 1: CPU offloading (lowest VRAM, ~3 GB) pipe.enable_model_cpu_offload() # Option 2: Sequential CPU offloading (even lower VRAM but slower) pipe.enable_sequential_cpu_offload() # Option 3: Attention slicing (reduces peak memory) pipe.enable_attention_slicing() # Option 4: VAE slicing for large images pipe.enable_vae_slicing() # Option 5: xFormers memory-efficient attention pipe.enable_xformers_memory_efficient_attention() # Option 6: FP16 VAE tiling for very large images pipe.enable_vae_tiling() # Combine for maximum savings pipe.enable_model_cpu_offload() pipe.enable_attention_slicing() pipe.enable_vae_slicing()

Best Practices

  1. Start with DPMSolverMultistepScheduler -- It delivers excellent quality in 15-25 steps, often 2-3x faster than the default scheduler with comparable output.
  2. Use negative prompts consistently -- Always include quality-oriented negatives like "blurry, low quality, distorted, bad anatomy, ugly" to steer outputs away from common artifacts.
  3. Match resolution to model training -- SD 1.5 was trained at 512x512, SDXL at 1024x1024. Generating at other resolutions can produce artifacts or duplicated subjects.
  4. Guidance scale sweet spot is 7-10 -- Below 5 produces generic results; above 12 causes over-saturation and artifacts. The 7-10 range balances creativity with prompt adherence.
  5. Seed-lock for iteration -- Fix the random seed when iterating on prompts so that visual changes are only due to prompt edits, not random noise differences.
  6. Enable model_cpu_offload on consumer GPUs -- This keeps only the active pipeline stage on GPU, reducing VRAM from 10+ GB to ~3 GB with minimal speed penalty.
  7. Batch generation with num_images_per_prompt -- Generating 4 images per call is more efficient than 4 separate calls because the text encoder runs only once.
  8. Use ControlNet for structural consistency -- When you need the output to follow a specific layout, pose, or depth map, ControlNet provides far more control than prompt engineering alone.
  9. Validate LoRA compatibility -- LoRA adapters trained on SD 1.5 do not work with SDXL and vice versa. Always check the base model version before loading adapters.
  10. Pre-compute and cache image embeddings -- For image-to-image pipelines processing the same source image with different prompts, encode the image once and reuse.

Troubleshooting

Generated images have duplicated subjects or limbs: This typically occurs when generating at a resolution the model was not trained for. Use 512x512 for SD 1.5 or 1024x1024 for SDXL.

CUDA out of memory errors: Enable pipe.enable_model_cpu_offload() and pipe.enable_attention_slicing(). If still failing, switch to a smaller model variant or reduce batch size to 1.

Images look over-saturated or burned: Lower the guidance_scale from 7.5 to 5.0-6.0. High guidance values push the model too hard toward the prompt, causing color clipping.

LoRA weights produce no visible effect: Verify the LoRA was trained for the same base model. Check that weight_name matches the actual filename. Try increasing lora_scale to 1.0 to confirm the weights loaded.

Inpainting bleeds outside the mask boundary: Ensure the mask is a clean binary image (pure white for inpaint regions, pure black for preserve). Feathered or gray mask edges cause bleeding artifacts.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates