The Art of Visual Prompting
Text prompts and image prompts are different languages. The operators who get cinematic output from AI image models have internalized five descriptors — Subject, Style, Lighting, Composition, Mood — and they stack them in order, every time, without exception.
The operators who get great text from AI models and the operators who get great images from AI models are using different skillsets. The overlap is smaller than you think. Text prompting is about role, context, task, and format. Image prompting is about visual language — the five-descriptor stack that tells the model what to render, how to render it, where to light it, how to frame it, and what emotional register to hit.
Most people collapse these two skillsets and wonder why their image prompts underperform. The answer is not the model. The answer is the wrong mental model for what an image prompt is.
Why Text and Image Prompts Are Different Languages
In text prompting, ambiguity is partially recoverable. The model infers from context, fills in gaps with domain knowledge, and can produce usable output even when the instruction is underspecified. You can course-correct in conversation.
In image prompting, every ambiguity produces a default — and model defaults are almost never what you actually want. "A sunset" will generate some sunset. It will not generate your sunset: the specific light temperature, the cloud formation, the foreground element, the composition ratio, the color temperature of the shadows. Those variables exist in the model's possibility space. Without specification, it picks randomly from them.
The gap between a mediocre image prompt and a professional image prompt is specificity across five dimensions. That is it. Not model size, not API tier, not platform. Specificity.
Descriptor 1: Subject
The Subject is the primary focus of the image. It answers the question: who or what is this about?
Bad: "a woman" Better: "a woman in her late 30s, wearing a charcoal linen blazer, seated at a dark wood desk, looking directly into camera with calm authority"
The difference is not length. The difference is elimination of model choices. "A woman" gives the model complete latitude on age, attire, expression, pose, build, race, background relationship. The model picks randomly. You get a random person.
Subject specificity is about progressively eliminating model defaults. Start with species and rough demographic. Add attire if it matters. Add pose and relationship to camera. Add emotional state. Each addition removes a variable the model would otherwise randomize.
Descriptor 2: Style
Style defines the visual language or medium. It tells the model what artistic tradition or technical approach to render in.
Common style anchors: cinematic photography, oil painting, watercolor, architectural visualization, anime, film noir, photorealism, concept art, editorial illustration, street photography. Each of these activates a different aesthetic mode in the model.
Artist references are style accelerators. "Inspired by the lighting of Roger Deakins" or "in the painterly style of Craig Mullins" pulls specific visual signatures from the model's training data. Use artist references as modifiers, not replacements — combine them with medium and intent descriptors.
Descriptor 3: Lighting
Lighting is the single most controllable lever for mood in image generation. The same subject, same style, same composition — change the lighting and you have a fundamentally different image.
The operators who produce consistently cinematic output specify lighting on every prompt, every time. They do not leave it to model defaults. Model default lighting is competent and generic. Specified lighting is intentional.
Lighting vocabulary:
- Temperature: warm 3200K tungsten vs cool 5600K daylight vs neon mixed
- Direction: front-lit (flat, even), side-lit (dramatic shadows, texture), back-lit (rim light, silhouette)
- Quality: hard (direct sun, sharp shadows), soft (overcast, diffuse)
- Time: golden hour (first/last hour of daylight), blue hour (just after sunset), midday (harsh, overhead)
- Source type: practical (visible light source in frame), ambient, studio strobe, LED panel
Stack two or three of these: "overcast diffuse light, cool blue-white color temperature, subtle directional fill from camera left" is a complete lighting description that leaves almost no room for model guessing.
Descriptor 4: Composition
Composition controls the spatial relationship between subjects, camera angle, and framing.
Key composition vocabulary:
- Framing: extreme close-up (ECU), close-up (CU), medium, wide, establishing
- Angle: bird's eye, worm's eye, eye level, Dutch angle (tilted for unease)
- Rule of thirds: subject positioned at one-third intersection, not center
- Depth: foreground elements framing the subject create layered depth
- Symmetry: centered, mirrored compositions convey stability or tension
Composition matters most when you are generating images for specific containers. A 16:9 hero image for a blog header needs a wide establishing composition with subject placement that leaves text overlay space. A 1:1 thumbnail needs the subject centered and dominant. Specify composition based on the output use case, not aesthetic preference alone.
Descriptor 5: Mood
Mood is the emotional register of the image. It synthesizes everything above and adds the affective layer — what the viewer should feel when they look at this.
Mood vocabulary: cinematic tension, quiet solitude, oppressive dread, nostalgic warmth, playful energy, corporate confidence, ethereal calm, gritty realism, aspirational optimism. Each of these activates a cluster of visual choices in the model — color palette, contrast, texture, depth of field, subject expression.
Mood is the last descriptor because it is the synthesizing layer. If your Subject, Style, Lighting, and Composition are precisely specified, Mood fine-tunes the interpretation. If they are vague, Mood carries too much weight and the output drifts.
Negative Prompts
Negative prompts tell the model what to exclude. They are supported by most platforms (Stable Diffusion, Leonardo, some others) and are distinct from the positive prompt.
A baseline negative prompt for quality control: blurry, low quality, watermark, text overlay, jpeg artifacts, oversaturated, deformed hands, extra limbs, low resolution
Negative prompts are not fixes for a bad positive prompt. They are quality insurance on top of a good positive prompt. If your subject specification is weak, no amount of negative prompting will produce consistent output.
Aspect Ratio and Resolution
Specify these based on output destination, not model defaults. Model defaults are 1:1 or whatever the provider considers standard. That is rarely what your pipeline actually needs.
Match aspect ratio to container: 16:9 for landscape web hero images, 9:16 for vertical social media, 1:1 for thumbnails and social squares, 3:2 for editorial photography, 4:5 for Instagram portrait.
Resolution specification varies by platform — some use pixel dimensions, some use named presets. The production rule: always specify both aspect ratio and the highest resolution tier available for your use case. Upscaling after the fact costs quality.
Lesson 97 Drill
Take an image you have tried to generate with a weak prompt. Rewrite it using all five descriptors — Subject, Style, Lighting, Composition, Mood — plus a baseline negative prompt. Add an explicit aspect ratio. Generate with the rewritten prompt and compare outputs. Document which visual variables changed and which stayed ambiguous.
Bottom Line
Visual prompting is not prompting with pictures. It is a distinct craft with its own vocabulary. The five-descriptor stack — Subject, Style, Lighting, Composition, Mood — eliminates the randomness that turns model defaults into mediocre output. Stack all five, specify them precisely, add negative prompt quality insurance, and specify your output dimensions. That is the craft. The rest of Track 12 applies it to specific platforms.