Veo 3 Prompting Guide: The First AI Video Model That Hears

Veo 3 generates video AND synchronized audio from one prompt — dialogue, ambient sound, music, and foley. Here's how to direct both picture and sound.

Veo 3 from Google DeepMind generates synchronized audio-visual content from a single prompt. Dialogue, ambient sound, foley, and music are all part of the output — not layered on afterward. This changes how you write prompts.

Developer: Google DeepMind Available on Splice: Yes — splice.film.fun (as Google Veo 3 Fast) Resolution: 1080p with audio Aspect ratios: 16:9 (landscape), 9:16 (vertical) Duration: 8 seconds Features: Reference Images, Advanced settings, native audio generation (dialogue, sound, music)

Prompt Structure

Veo 3 prompts are production instructions covering both picture and sound. Include visual direction AND audio direction in every prompt.

The 7-Element Framework

Element	What to Include	Example
1. Subject	Who/what, with physical details	"A woman in her 30s with short dark hair in a linen shirt"
2. Context	Where, when, conditions	"Cobblestone café terrace, late afternoon"
3. Action	What happens	"She sets her coffee down and leans forward"
4. Style	Visual aesthetic	"Warm indie film tone, shallow depth of field"
5. Camera	Shot type, movement, composition	"Medium shot, slow push-in"
6. Ambiance	Mood, lighting	"Golden hour backlight, muted earth tones"
7. Audio	Sound, dialogue, music	Ambient sound, dialogue in quotes, music description

The Golden Rules

Always prompt audio — If you skip it, Veo guesses (and often guesses wrong — unwanted studio audience laughter is common)
Dialogue after a colon, not in quotes — He says: My name is Ben reduces subtitle generation
Keep dialogue short — Under 10 words per line. More = unnaturally fast speech
"No music" is valid — Pure environmental sound is often more powerful
Be specific about style — "Documentary realism" and "commercial" produce very different results
Change prompts for variety — Unlike other models, same prompt = very similar result across seeds

Prompt Examples

Example 1: Dialogue Scene

Medium shot, cozy kitchen. A mother and daughter sit at a breakfast 
table. Morning sunlight streams through gauze curtains. The mother 
pours coffee and says: You're going to be great today. The daughter 
smiles and replies: Thanks, mom. Clinking dishes, birds outside. 
Warm indie film tone. (no subtitles)

Example 2: Action with Sound Design

Low angle tracking shot. A motorcycle roars down a rain-soaked 
highway at night. Tires hiss on wet asphalt. Engine growl builds 
as it accelerates. Red taillights blur in the rain. Thunder rumbles 
in the distance. Dark, moody, cinematic.

Example 3: Atmospheric Nature

Wide aerial shot slowly descending over a misty forest at dawn. 
Fog threads between redwood trees. A river catches the first 
golden light. Wind through canopy, distant waterfall, single bird 
call echoing. No music. Documentary realism.

Example 4: Selfie Video

A selfie video of a travel blogger exploring a bustling Tokyo 
street market. She's wearing a vintage denim jacket, excitement 
in her eyes. Afternoon sun creates shadows between vendor stalls. 
She samples street food while talking, occasionally glancing at 
camera then turning to point at stalls. Slightly grainy, film-like. 
She says: Okay, you have to try this place when you visit Tokyo. 
The takoyaki here is absolutely incredible. (no subtitles)

Selfie tip: Start with "A selfie video of..." and make the arm visible for authenticity.

Example 5: Musical Performance

Close-up of a street musician's fingers on guitar strings. 
Flamenco style, fast rhythmic strumming. Camera slowly pulls back 
to reveal him on a stone step in a Spanish courtyard. Afternoon 
light, long shadows. Guitar music fills the space, echoing off 
stone walls. Passersby pause to listen.

Example 6: Commercial Product Shot

Slow motion close-up of coffee being poured into a white cup. 
Steam rises in golden morning light. Rich dark liquid swirls. 
Sound of pouring, soft ceramic clink as cup settles on saucer. 
Warm, premium. Shallow depth of field, macro quality.

Voice and Dialogue

Writing Dialogue

Use colon format: He says: My name is Ben (not quotes — reduces subtitle generation)
Keep lines short — What can be said in 8 seconds. Too many words = unnaturally fast
Too few words = AI gibberish — Give enough for the model to fill the time naturally
Implicit works too: "A guy introduces himself" — Veo decides the words
Spell names phonetically: "foh-fur" not "fofr" for correct pronunciation
Specify who speaks: "The woman in pink says: ..." / "The man with glasses replies: ..."

Avoiding Subtitles

Veo often bakes in subtitles. Three fixes:

Use colon format for speech (not quotes)
Add (no subtitles) to the prompt
Repeat if persistent: No subtitles. No subtitles!

The Unwanted Studio Audience

Veo hallucinates live studio audience laughter if you don't specify ambient audio. Always describe the soundscape you want:

❌ "A standup comic tells a joke at a festival"
✅ "A standup comic tells a joke at a festival. Sounds of 
   distant bands, noisy crowd, ambient background of a busy 
   festival field. (no studio audience)"

Audio Prompting

✅ Do

Tie sounds to visible actions: "She sets the glass down with a clink"
Use spatial cues: "Distant thunder," "footsteps from behind camera"
Specify absence: "No music, only natural sound"
Name instruments: "Solo cello" beats "music plays"
Describe mood: "Ominous low drone," "playful piano melody"

❌ Don't

Describe a full soundtrack — sounds will compete
Layer more than 3-4 audio elements — they muddy
Use song titles or artist names — won't work
Skip audio direction — you'll get random ambient noise

Audio Techniques

Silence as a tool:

A crowded restaurant full of chatter. Everything goes quiet. 
A single glass falls and shatters.

Off-screen audio:

Footsteps approaching from behind the camera.

Sync points:

A blacksmith hammers red-hot metal. Each strike sends sparks. 
Clang of metal rings with each impact.

Reference Images

Veo 3 on Splice supports Reference Images — upload images to guide the generation.

Style Preservation

Feed any image (cartoon, painting, photograph) and Veo 3 maintains the visual style:

Keep the style the same

That's often enough. For more control:

The man runs through wild shrubbery. He says to his microphone: 
This is Echo 1, I'm being pursued. Camera swivels to reveal 
jungle terrain. Maintain the animation style of the original 
image. (no subtitles)

Image-to-Video Strategy

Generate your perfect still with an image model, then animate with Veo 3. This offloads style decisions to the image step:

Make him run!

Simple motion prompts work when the reference image carries the style.

Selective Animation

Animate only part of the image:

Rotate the shoe, keep everything else still.

Creates cinematic cinemagraph effects — one element moves, rest stays frozen.

Character Consistency (Without Reference Images)

Veo 3 is unusually consistent across seeds — same prompt often gives identical clothing, earrings, even room layout. Leverage this:

Create character description sheets with exact wording
Reuse the description verbatim across prompts
The more unique the description, the better consistency

John, a man in his 40s with short brown hair, wearing a blue 
jacket and glasses, looking thoughtful

Use this exact string in every prompt featuring John.

Note: Different seeds with the same prompt give similar (not varied) results. Change the prompt for variety.

Style Transfer

Veo 3 knows many visual styles. Prefix with In the style of [style]::

Proven styles: LEGO, Claymation, South Park, Pixar animation, 8-bit retro, Graphic novel, Origami, Simpsons, Blueprint, Anime, Marble

Style affects motion too — claymation characters move jerkily, Pixar characters move smoothly.

What Veo 3 Excels At

Strength	Details
Native audio-visual sync	Dialogue, foley, ambient, music — all synchronized to the visual
Style transfer	12+ visual styles that transform motion as well as look
Character consistency	Same prompt = remarkably consistent character across seeds
Selfie videos	Surprisingly realistic first-person handheld footage
Reference image preservation	Maintains artistic style, color grading, and visual identity from input images

What to Avoid

Avoid	Why	Do This Instead
Skipping audio direction	Random ambient noise, unwanted laughter	Always describe the soundscape
Long dialogue	More than ~10 words = too fast	Keep lines short, under 10 words
Same prompt for variety	Veo 3 gives very similar results per prompt	Change the prompt itself
Dialogue in quotes	Triggers subtitle generation	Use colon format: `says:`
Monologues	Can't fit in 8 seconds	1-2 short exchanges maximum
No style specified	Defaults to generic "well-produced live action"	Name the style explicitly

Using Veo 3 on Splice

On Splice, Veo 3 is available as Google Veo 3 Fast with these settings:

Setting	Options
Resolution	1080p with audio
Aspect ratio	16:9 (landscape), 9:16 (vertical)
Duration	8 seconds
Reference Images	Toggle on to upload reference images
Advanced	Additional generation settings

Choosing Your Aspect Ratio

Ratio	Use Case
16:9	Cinematic widescreen — films, YouTube, presentations, most content
9:16	Vertical — TikTok, Instagram Reels, Stories, selfie videos

Working with 8 Seconds

8 seconds is your canvas. Plan for it:

One scene, one moment — Don't try to fit a whole story
1-2 dialogue exchanges maximum — More gets rushed
One camera movement — Dolly in OR pan, not both
Audio fills the time — Even when visual action is minimal, ambient sound keeps it alive
Build longer sequences by generating multiple 8s clips and editing them together in Splice

Common Mistakes

❌ Ignoring audio entirely

Bad: "A dog runs through flowers."
Good: "A golden retriever bounds through wildflowers. Panting, 
paws rustling grass. Distant birdsong. Gentle breeze. Joyful."

❌ Running the same prompt for variety

Unlike other models, Veo 3 produces very similar results across seeds. Change the prompt itself for different outputs.

❌ Dialogue in quotation marks

Bad: He says "My name is Ben"
Good: He says: My name is Ben

Colon format significantly reduces unwanted subtitle generation.

❌ No ambient audio specified

Bad: "A comedian tells jokes on stage"
Good: "A comedian tells jokes on a festival stage. Distant 
music from other stages, crowd murmur, outdoor breeze. 
(no studio audience)"

Pro Tips

Write for picture AND sound — Every prompt needs audio direction
Colon format for dialogue — says: not says "..." — kills subtitles
"No music" is powerful — Pure environmental sound often beats a score
Specify ambient audio — Or risk hallucinated studio audience laughter
Reference images for style control — Generate perfect stills, then animate
Selective animation creates cinemagraphs — "Rotate the shoe, keep everything else still"
Character sheets for consistency — Same exact description string across prompts
Change prompts for variety — Rerolling same prompt won't give different results
Spell names phonetically — For correct pronunciation in dialogue
Selfie videos work — "A selfie video of..." with visible arm unlocks the format

*Ready to put these techniques into practice? Try Splice — film.fun's AI Creator Studio. Generate video, edit in the browser, and bring your stories to life.