Stable Diffusion Studio -- Comprehensive Image Generation

Overview

A complete skill for AI image generation using Stable Diffusion with the HuggingFace Diffusers library. Covers text-to-image, image-to-image, inpainting, ControlNet conditioning, LoRA adapters, SDXL, and SD 3.0 pipelines. Stable Diffusion operates in latent space using a three-component architecture -- a text encoder (CLIP/T5) for understanding prompts, a UNet or Transformer for iterative denoising, and a VAE for encoding/decoding between pixel and latent space. This skill provides production-ready patterns for generating, transforming, and controlling image outputs.

When to Use

Generating images from text descriptions with full control over the process
Performing image-to-image style transfer, enhancement, or transformation
Inpainting masked regions with context-aware content generation
Using ControlNet for structure-preserving generation (edges, poses, depth maps)
Applying LoRA adapters for domain-specific styles or subjects
Building custom image generation pipelines or creative tools
Need local, self-hosted image generation without API dependencies

Quick Start


# Install core dependencies
pip install diffusers transformers accelerate torch

# Optional: memory-efficient attention
pip install xformers


from diffusers import DiffusionPipeline
import torch

# Load Stable Diffusion pipeline
pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe.to("cuda")

# Generate an image
image = pipe(
    "A serene mountain landscape at golden hour, highly detailed, 8k",
    num_inference_steps=50,
    guidance_scale=7.5,
).images[0]

image.save("landscape.png")

Core Concepts

Pipeline Architecture

Text Prompt ──► Text Encoder (CLIP/T5) ──► Text Embeddings
                                                  │
Random Gaussian Noise ──► Denoising Loop ◄── Scheduler (step control)
                              │
                     Denoised Latent Tensor
                              │
                     VAE Decoder ──► Final Image (pixel space)

The scheduler controls how noise is progressively removed over N steps. Different schedulers trade off speed, quality, and determinism.

Available Pipelines

Pipeline Class	Model	Resolution	Use Case
`StableDiffusionPipeline`	SD 1.5	512x512	Fast, broad ecosystem
`StableDiffusionXLPipeline`	SDXL	1024x1024	Higher quality, detail
`StableDiffusion3Pipeline`	SD 3.0	1024x1024	Latest architecture
`FluxPipeline`	Flux	512-1024	Flow matching models
`StableDiffusionImg2ImgPipeline`	SD 1.5	Variable	Transform images
`StableDiffusionInpaintPipeline`	SD 1.5	Variable	Fill masked regions
`StableDiffusionControlNetPipeline`	SD 1.5	512x512	Structure control

SDXL -- Higher Quality Generation


from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
)
pipe.to("cuda")
pipe.enable_model_cpu_offload()  # Reduce VRAM usage

image = pipe(
    prompt="Cyberpunk cityscape at night, neon lights reflecting on wet streets",
    negative_prompt="blurry, low quality, distorted, ugly",
    height=1024,
    width=1024,
    num_inference_steps=30,
    guidance_scale=7.0,
).images[0]

Schedulers -- Speed vs Quality

Schedulers control the denoising algorithm. Swapping schedulers requires zero retraining:


from diffusers import DPMSolverMultistepScheduler

# Replace default scheduler with faster one
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

# Now generate with fewer steps for similar quality
image = pipe("A portrait of a wizard", num_inference_steps=20).images[0]

Scheduler	Typical Steps	Quality	Speed	Notes
`EulerDiscreteScheduler`	20-50	Good	Medium	Solid default
`EulerAncestralDiscreteScheduler`	20-50	Good	Medium	More variation
`DPMSolverMultistepScheduler`	15-25	Excellent	Fast	Recommended
`DDIMScheduler`	50-100	Good	Slow	Deterministic
`LCMScheduler`	4-8	Good	Very Fast	Latent consistency
`UniPCMultistepScheduler`	15-25	Excellent	Fast	Predictor-corrector

Generation Parameters

Parameter	Default	Range	Description
`prompt`	Required	--	Text description of the desired image
`negative_prompt`	None	--	What to avoid (artifacts, styles)
`num_inference_steps`	50	4-100	Denoising iterations; more = better but slower
`guidance_scale`	7.5	1-20	Prompt adherence; 7-12 is typical
`height`	512/1024	Multiple of 8	Output height in pixels
`width`	512/1024	Multiple of 8	Output width in pixels
`num_images_per_prompt`	1	1-8	Batch size per generation
`generator`	None	--	Torch generator for reproducibility

Reproducible Generation


import torch

generator = torch.Generator(device="cuda").manual_seed(42)

image = pipe(
    prompt="A cat wearing a top hat, oil painting style",
    generator=generator,
    num_inference_steps=50,
).images[0]

# Same seed + same parameters = identical output

Image-to-Image Transformation


from diffusers import AutoPipelineForImage2Image
from PIL import Image
import torch

pipe = AutoPipelineForImage2Image.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

init_image = Image.open("photo.jpg").resize((512, 512))

image = pipe(
    prompt="A watercolor painting of the scene, artistic, vibrant",
    image=init_image,
    strength=0.75,  # 0.0 = no change, 1.0 = complete regeneration
    num_inference_steps=50,
).images[0]

Inpainting


from diffusers import AutoPipelineForInpainting
from PIL import Image
import torch

pipe = AutoPipelineForInpainting.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
).to("cuda")

image = Image.open("room.jpg")
mask = Image.open("mask.png")  # White pixels = regions to repaint

result = pipe(
    prompt="A modern leather sofa, interior design photography",
    image=image,
    mask_image=mask,
    num_inference_steps=50,
    guidance_scale=7.5,
).images[0]

ControlNet -- Structure-Preserving Generation


from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

# Load ControlNet for Canny edge conditioning
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/control_v11p_sd15_canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
).to("cuda")

# Prepare control image (Canny edges)
import cv2
import numpy as np
from PIL import Image

input_img = cv2.imread("building.jpg")
edges = cv2.Canny(input_img, 100, 200)
control_image = Image.fromarray(edges)

image = pipe(
    prompt="A futuristic glass building, photorealistic, 4k",
    image=control_image,
    num_inference_steps=30,
).images[0]

ControlNet Models

Model	Conditioning	Use Case
`control_v11p_sd15_canny`	Edge maps	Preserve structural outlines
`control_v11p_sd15_openpose`	Pose skeletons	Maintain human poses
`control_v11f1p_sd15_depth`	Depth maps	3D-aware generation
`control_v11p_sd15_normalbae`	Normal maps	Surface detail control
`control_v11p_sd15_mlsd`	Line segments	Architectural geometry
`control_v11p_sd15_scribble`	Rough sketches	Sketch-to-image conversion

LoRA Adapters


from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")

# Load LoRA weights (small adapter files, typically 2-50 MB)
pipe.load_lora_weights("path/to/lora", weight_name="style.safetensors")

image = pipe("A portrait in the trained style").images[0]

# Control LoRA influence strength
pipe.fuse_lora(lora_scale=0.8)  # 0.0 = no effect, 1.0 = full effect

# Unload LoRA when no longer needed
pipe.unfuse_lora()
pipe.unload_lora_weights()

Memory Optimization


# Option 1: CPU offloading (lowest VRAM, ~3 GB)
pipe.enable_model_cpu_offload()

# Option 2: Sequential CPU offloading (even lower VRAM but slower)
pipe.enable_sequential_cpu_offload()

# Option 3: Attention slicing (reduces peak memory)
pipe.enable_attention_slicing()

# Option 4: VAE slicing for large images
pipe.enable_vae_slicing()

# Option 5: xFormers memory-efficient attention
pipe.enable_xformers_memory_efficient_attention()

# Option 6: FP16 VAE tiling for very large images
pipe.enable_vae_tiling()

# Combine for maximum savings
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()
pipe.enable_vae_slicing()

Best Practices

Start with DPMSolverMultistepScheduler -- It delivers excellent quality in 15-25 steps, often 2-3x faster than the default scheduler with comparable output.
Use negative prompts consistently -- Always include quality-oriented negatives like "blurry, low quality, distorted, bad anatomy, ugly" to steer outputs away from common artifacts.
Match resolution to model training -- SD 1.5 was trained at 512x512, SDXL at 1024x1024. Generating at other resolutions can produce artifacts or duplicated subjects.
Guidance scale sweet spot is 7-10 -- Below 5 produces generic results; above 12 causes over-saturation and artifacts. The 7-10 range balances creativity with prompt adherence.
Seed-lock for iteration -- Fix the random seed when iterating on prompts so that visual changes are only due to prompt edits, not random noise differences.
Enable model_cpu_offload on consumer GPUs -- This keeps only the active pipeline stage on GPU, reducing VRAM from 10+ GB to ~3 GB with minimal speed penalty.
Batch generation with num_images_per_prompt -- Generating 4 images per call is more efficient than 4 separate calls because the text encoder runs only once.
Use ControlNet for structural consistency -- When you need the output to follow a specific layout, pose, or depth map, ControlNet provides far more control than prompt engineering alone.
Validate LoRA compatibility -- LoRA adapters trained on SD 1.5 do not work with SDXL and vice versa. Always check the base model version before loading adapters.
Pre-compute and cache image embeddings -- For image-to-image pipelines processing the same source image with different prompts, encode the image once and reuse.

Troubleshooting

Generated images have duplicated subjects or limbs: This typically occurs when generating at a resolution the model was not trained for. Use 512x512 for SD 1.5 or 1024x1024 for SDXL.

CUDA out of memory errors: Enable pipe.enable_model_cpu_offload() and pipe.enable_attention_slicing(). If still failing, switch to a smaller model variant or reduce batch size to 1.

Images look over-saturated or burned: Lower the guidance_scale from 7.5 to 5.0-6.0. High guidance values push the model too hard toward the prompt, causing color clipping.

LoRA weights produce no visible effect: Verify the LoRA was trained for the same base model. Check that weight_name matches the actual filename. Try increasing lora_scale to 1.0 to confirm the weights loaded.

Inpainting bleeds outside the mask boundary: Ensure the mask is a clean binary image (pure white for inpaint regions, pure black for preserve). Feathered or gray mask edges cause bleeding artifacts.

⚠️ Loading Issue

Multimodal Stable Diffusion Studio