Stable Diffusion and Open Source Image Gen

Open source image generation is a different category from API providers. Not better or worse — different constraints, different advantages, different use cases. When you self-host Stable Diffusion, your cost structure changes completely: you pay for GPU compute, not per image. At sufficient volume, that math changes everything.

The operators who have internalized when to use open source and when to stay with API providers make better architecture decisions. This lesson covers the Stable Diffusion ecosystem, the LoRA fine-tuning system that enables brand-consistent output at scale, and the infrastructure decisions that determine whether self-hosting is worth the operational overhead.

Why Open Source Matters for Image Generation

API providers impose three constraints that open source eliminates: per-image cost, content filters, and rate limits. For many legitimate use cases, one or more of these constraints is the binding constraint on what is buildable.

Per-image cost. At 1,000 images per day, $0.008 per image (Leonardo) costs $240 per month. At 10,000 images per day, it costs $2,400 per month. Self-hosted Stable Diffusion on a rented RTX 4090 at ~$0.50/hour generates roughly 0.3–0.5 standard SDXL images per second at 1024px (and an order of magnitude faster with SDXL-Turbo or Lightning distills). The economics flip dramatically at scale.

Content filters. API providers apply content policies that are calibrated to their risk tolerance, not yours. Legitimate use cases in medical imaging, art education, creative writing, and certain legal contexts encounter false positives regularly. Self-hosted SD removes filters you did not add and lets you manage the content policy appropriate to your context. This is not a license for harmful content — it is recognition that your legitimate use case may not match the API provider's definition of acceptable.

Rate limits. No self-imposed rate limit. Your throughput is bounded only by your GPU capacity. For pipelines that need to burst to high generation volume, eliminating rate-limit architecture is significant.

Model Versions — SD 1.5, SDXL, SD3

SD 1.5

The original stable release. 512×512 native resolution, fastest inference, lowest VRAM requirements (4GB is sufficient). The LoRA ecosystem for SD 1.5 is enormous — thousands of community fine-tunes available on Civitai and Hugging Face covering every style, character, and domain.

SD 1.5 is the right choice when you need maximum speed, minimum infrastructure cost, and access to the largest model ecosystem. Image quality is lower than SDXL or SD3 but more than sufficient for social thumbnails, blog images, and content generation at scale.

SDXL 1.0

Native 1024×1024 resolution. Significantly better prompt adherence than SD 1.5. Dual-encoder architecture improves complex scene description accuracy. Requires ~8GB VRAM for stable inference at base resolution.

SDXL is the current practical production standard for operators who need quality above SD 1.5 but are not ready to commit to SD3's hardware requirements.

SD3 and SD3.5

SD3's multi-modal diffusion transformer architecture produces the best text rendering of any Stable Diffusion release — a historically weak point for diffusion models. Multi-subject scenes are handled significantly better than previous versions. Image quality approaches SDXL with better prompt adherence.

The tradeoff is compute: SD3 Medium needs ~16GB VRAM. SD3.5 Large needs 24GB+. That narrows the self-hosting hardware options but remains feasible on A100 or H100 class GPUs, which are available for rental.

LoRA Fine-Tuning — Brand Consistency at Scale

LoRA (Low-Rank Adaptation) is the system that makes Stable Diffusion practically useful for brand content at scale. A LoRA is a small weight adapter file — typically 50-200MB — that injects custom style, character, or subject knowledge into any base model without full retraining.

The training process:

Collect 20-30 reference images that capture the target style or subject
Caption each image with descriptive text (automated or manual)
Train a LoRA on a base model (SD 1.5 or SDXL) — ~30 minutes on an RTX 4090
Apply the LoRA at inference time with a trigger word in the prompt

The output: image generation that reliably produces your brand aesthetic, product style, or character appearance without specifying it in every prompt. You trigger the LoRA and the visual consistency is baked in.

Interfaces — AUTOMATIC1111 vs ComfyUI

AUTOMATIC1111

The established web UI for Stable Diffusion. Feature-rich, with extensive extension ecosystem, established workflows for img2img and inpainting, and a large community knowledge base. The learning curve is manageable and the documentation is thorough.

The production limitation: AUTOMATIC1111 is designed for human-in-the-loop workflows. Its API is functional but the interface-first design makes automation feel like a workaround. For pipelines that require programmatic control, this creates friction.

ComfyUI

Node-based workflow editor. Every generation step — model loading, encoding, sampling, decoding — is an explicit node you wire together. The visual complexity is higher initially, but the payoff is complete control and a fundamentally different automation story.

ComfyUI workflows export as JSON. That JSON is directly executable via the ComfyUI API. Building a pipeline becomes: design the workflow in the UI, export JSON, call the API with the workflow JSON, handle the output. This is the cleanest automation path in the SD ecosystem.

For operators building production pipelines, ComfyUI is the correct choice. The initial complexity investment pays off in programmability.

Self-Hosting vs API (Replicate / RunPod)

Not every operator needs to manage their own GPU infrastructure. Replicate and RunPod offer SD models as API endpoints with GPU-backed inference at per-second billing.

The math:

RunPod RTX 4090: ~$0.50/hour, ~0.4 images/second standard SDXL (≈1,440/hour) → ~$0.35 per 1000 images. With SDXL-Turbo or Lightning distills, throughput climbs roughly 10x and the per-1000 cost drops accordingly.
Leonardo AI: $0.008 per image → $8 per 1000 images

At 1,000 images per month: API wins on simplicity. At 100,000 images per month: self-hosting wins on cost by more than 20x even on conservative standard-SDXL throughput — and far more with Turbo/Lightning distills.

The crossover point depends on your image volume and operational tolerance. For most content pipelines under 500 images per day, the API simplicity is worth the cost premium. Above that, the math forces the conversation.

Content Policy and Legitimate Use

Self-hosting does not remove legal constraints — it removes the API provider's policy overlay. Your own content policy applies. Your jurisdiction's laws apply. Self-hosting is appropriate for use cases where API content policies produce false positives on legitimate work. It is not appropriate for circumventing laws that govern harmful content.

That distinction matters for both ethics and legal exposure.

Lesson Drill

Set up ComfyUI locally or on a RunPod instance. Load SDXL 1.0. Build a basic text-to-image workflow using the default node set. Export it as JSON. Write a Python script that calls the ComfyUI API with your workflow JSON and saves the output image. Extend the script to vary the prompt parameter. You now have a programmable SD generation pipeline.

Bottom Line

Stable Diffusion is the open source escape valve from API provider constraints. Zero marginal cost at scale, content sovereignty, and LoRA fine-tuning for brand consistency. The tradeoff is infrastructure: you manage model weights, GPU capacity, and operational reliability. ComfyUI is the correct interface for pipeline use. LoRA training is worth doing when visual consistency matters more than the training investment. Know the crossover point in your cost model before committing to the infrastructure investment.