Veo 3 generates video AND synchronized audio from one prompt — dialogue, ambient sound, music, and foley. Here's how to direct both picture and sound.
Veo 3 from Google DeepMind generates synchronized audio-visual content from a single prompt. Dialogue, ambient sound, foley, and music are all part of the output — not layered on afterward. This changes how you write prompts.
Developer: Google DeepMind Available on Splice: Yes — splice.film.fun (as Google Veo 3 Fast) Resolution: 1080p with audio Aspect ratios: 16:9 (landscape), 9:16 (vertical) Duration: 8 seconds Features: Reference Images, Advanced settings, native audio generation (dialogue, sound, music)
Prompt Structure
Veo 3 prompts are production instructions covering both picture and sound. Include visual direction AND audio direction in every prompt.
The 7-Element Framework
| Element | What to Include | Example |
|---|---|---|
| 1. Subject | Who/what, with physical details | "A woman in her 30s with short dark hair in a linen shirt" |
| 2. Context | Where, when, conditions | "Cobblestone café terrace, late afternoon" |
| 3. Action | What happens | "She sets her coffee down and leans forward" |
| 4. Style | Visual aesthetic | "Warm indie film tone, shallow depth of field" |
| 5. Camera | Shot type, movement, composition | "Medium shot, slow push-in" |
| 6. Ambiance | Mood, lighting | "Golden hour backlight, muted earth tones" |
| 7. Audio | Sound, dialogue, music | Ambient sound, dialogue in quotes, music description |
The Golden Rules
- Always prompt audio — If you skip it, Veo guesses (and often guesses wrong — unwanted studio audience laughter is common)
- Dialogue after a colon, not in quotes —
He says: My name is Benreduces subtitle generation - Keep dialogue short — Under 10 words per line. More = unnaturally fast speech
- "No music" is valid — Pure environmental sound is often more powerful
- Be specific about style — "Documentary realism" and "commercial" produce very different results
- Change prompts for variety — Unlike other models, same prompt = very similar result across seeds
Prompt Examples
Example 1: Dialogue Scene
Medium shot, cozy kitchen. A mother and daughter sit at a breakfast
table. Morning sunlight streams through gauze curtains. The mother
pours coffee and says: You're going to be great today. The daughter
smiles and replies: Thanks, mom. Clinking dishes, birds outside.
Warm indie film tone. (no subtitles)
Example 2: Action with Sound Design
Low angle tracking shot. A motorcycle roars down a rain-soaked
highway at night. Tires hiss on wet asphalt. Engine growl builds
as it accelerates. Red taillights blur in the rain. Thunder rumbles
in the distance. Dark, moody, cinematic.
Example 3: Atmospheric Nature
Wide aerial shot slowly descending over a misty forest at dawn.
Fog threads between redwood trees. A river catches the first
golden light. Wind through canopy, distant waterfall, single bird
call echoing. No music. Documentary realism.
Example 4: Selfie Video
A selfie video of a travel blogger exploring a bustling Tokyo
street market. She's wearing a vintage denim jacket, excitement
in her eyes. Afternoon sun creates shadows between vendor stalls.
She samples street food while talking, occasionally glancing at
camera then turning to point at stalls. Slightly grainy, film-like.
She says: Okay, you have to try this place when you visit Tokyo.
The takoyaki here is absolutely incredible. (no subtitles)
Selfie tip: Start with "A selfie video of..." and make the arm visible for authenticity.
Example 5: Musical Performance
Close-up of a street musician's fingers on guitar strings.
Flamenco style, fast rhythmic strumming. Camera slowly pulls back
to reveal him on a stone step in a Spanish courtyard. Afternoon
light, long shadows. Guitar music fills the space, echoing off
stone walls. Passersby pause to listen.
Example 6: Commercial Product Shot
Slow motion close-up of coffee being poured into a white cup.
Steam rises in golden morning light. Rich dark liquid swirls.
Sound of pouring, soft ceramic clink as cup settles on saucer.
Warm, premium. Shallow depth of field, macro quality.
Voice and Dialogue
Writing Dialogue
- Use colon format:
He says: My name is Ben(not quotes — reduces subtitle generation) - Keep lines short — What can be said in 8 seconds. Too many words = unnaturally fast
- Too few words = AI gibberish — Give enough for the model to fill the time naturally
- Implicit works too: "A guy introduces himself" — Veo decides the words
- Spell names phonetically: "foh-fur" not "fofr" for correct pronunciation
- Specify who speaks: "The woman in pink says: ..." / "The man with glasses replies: ..."
Avoiding Subtitles
Veo often bakes in subtitles. Three fixes:
- Use colon format for speech (not quotes)
- Add
(no subtitles)to the prompt - Repeat if persistent:
No subtitles. No subtitles!
The Unwanted Studio Audience
Veo hallucinates live studio audience laughter if you don't specify ambient audio. Always describe the soundscape you want:
❌ "A standup comic tells a joke at a festival"
✅ "A standup comic tells a joke at a festival. Sounds of
distant bands, noisy crowd, ambient background of a busy
festival field. (no studio audience)"
Audio Prompting
✅ Do
- Tie sounds to visible actions: "She sets the glass down with a clink"
- Use spatial cues: "Distant thunder," "footsteps from behind camera"
- Specify absence: "No music, only natural sound"
- Name instruments: "Solo cello" beats "music plays"
- Describe mood: "Ominous low drone," "playful piano melody"
❌ Don't
- Describe a full soundtrack — sounds will compete
- Layer more than 3-4 audio elements — they muddy
- Use song titles or artist names — won't work
- Skip audio direction — you'll get random ambient noise
Audio Techniques
Silence as a tool:
A crowded restaurant full of chatter. Everything goes quiet.
A single glass falls and shatters.
Off-screen audio:
Footsteps approaching from behind the camera.
Sync points:
A blacksmith hammers red-hot metal. Each strike sends sparks.
Clang of metal rings with each impact.
Reference Images
Veo 3 on Splice supports Reference Images — upload images to guide the generation.
Style Preservation
Feed any image (cartoon, painting, photograph) and Veo 3 maintains the visual style:
Keep the style the same
That's often enough. For more control:
The man runs through wild shrubbery. He says to his microphone:
This is Echo 1, I'm being pursued. Camera swivels to reveal
jungle terrain. Maintain the animation style of the original
image. (no subtitles)
Image-to-Video Strategy
Generate your perfect still with an image model, then animate with Veo 3. This offloads style decisions to the image step:
Make him run!
Simple motion prompts work when the reference image carries the style.
Selective Animation
Animate only part of the image:
Rotate the shoe, keep everything else still.
Creates cinematic cinemagraph effects — one element moves, rest stays frozen.
Character Consistency (Without Reference Images)
Veo 3 is unusually consistent across seeds — same prompt often gives identical clothing, earrings, even room layout. Leverage this:
- Create character description sheets with exact wording
- Reuse the description verbatim across prompts
- The more unique the description, the better consistency
John, a man in his 40s with short brown hair, wearing a blue
jacket and glasses, looking thoughtful
Use this exact string in every prompt featuring John.
Note: Different seeds with the same prompt give similar (not varied) results. Change the prompt for variety.
Style Transfer
Veo 3 knows many visual styles. Prefix with In the style of [style]::
Proven styles: LEGO, Claymation, South Park, Pixar animation, 8-bit retro, Graphic novel, Origami, Simpsons, Blueprint, Anime, Marble
Style affects motion too — claymation characters move jerkily, Pixar characters move smoothly.
What Veo 3 Excels At
| Strength | Details |
|---|---|
| Native audio-visual sync | Dialogue, foley, ambient, music — all synchronized to the visual |
| Style transfer | 12+ visual styles that transform motion as well as look |
| Character consistency | Same prompt = remarkably consistent character across seeds |
| Selfie videos | Surprisingly realistic first-person handheld footage |
| Reference image preservation | Maintains artistic style, color grading, and visual identity from input images |
What to Avoid
| Avoid | Why | Do This Instead |
|---|---|---|
| Skipping audio direction | Random ambient noise, unwanted laughter | Always describe the soundscape |
| Long dialogue | More than ~10 words = too fast | Keep lines short, under 10 words |
| Same prompt for variety | Veo 3 gives very similar results per prompt | Change the prompt itself |
| Dialogue in quotes | Triggers subtitle generation | Use colon format: says: |
| Monologues | Can't fit in 8 seconds | 1-2 short exchanges maximum |
| No style specified | Defaults to generic "well-produced live action" | Name the style explicitly |
Using Veo 3 on Splice
On Splice, Veo 3 is available as Google Veo 3 Fast with these settings:
| Setting | Options |
|---|---|
| Resolution | 1080p with audio |
| Aspect ratio | 16:9 (landscape), 9:16 (vertical) |
| Duration | 8 seconds |
| Reference Images | Toggle on to upload reference images |
| Advanced | Additional generation settings |
Choosing Your Aspect Ratio
| Ratio | Use Case |
|---|---|
| 16:9 | Cinematic widescreen — films, YouTube, presentations, most content |
| 9:16 | Vertical — TikTok, Instagram Reels, Stories, selfie videos |
Working with 8 Seconds
8 seconds is your canvas. Plan for it:
- One scene, one moment — Don't try to fit a whole story
- 1-2 dialogue exchanges maximum — More gets rushed
- One camera movement — Dolly in OR pan, not both
- Audio fills the time — Even when visual action is minimal, ambient sound keeps it alive
- Build longer sequences by generating multiple 8s clips and editing them together in Splice
Common Mistakes
❌ Ignoring audio entirely
Bad: "A dog runs through flowers."
Good: "A golden retriever bounds through wildflowers. Panting,
paws rustling grass. Distant birdsong. Gentle breeze. Joyful."
❌ Running the same prompt for variety
Unlike other models, Veo 3 produces very similar results across seeds. Change the prompt itself for different outputs.
❌ Dialogue in quotation marks
Bad: He says "My name is Ben"
Good: He says: My name is Ben
Colon format significantly reduces unwanted subtitle generation.
❌ No ambient audio specified
Bad: "A comedian tells jokes on stage"
Good: "A comedian tells jokes on a festival stage. Distant
music from other stages, crowd murmur, outdoor breeze.
(no studio audience)"
Pro Tips
- Write for picture AND sound — Every prompt needs audio direction
- Colon format for dialogue —
says:notsays "..."— kills subtitles - "No music" is powerful — Pure environmental sound often beats a score
- Specify ambient audio — Or risk hallucinated studio audience laughter
- Reference images for style control — Generate perfect stills, then animate
- Selective animation creates cinemagraphs — "Rotate the shoe, keep everything else still"
- Character sheets for consistency — Same exact description string across prompts
- Change prompts for variety — Rerolling same prompt won't give different results
- Spell names phonetically — For correct pronunciation in dialogue
- Selfie videos work — "A selfie video of..." with visible arm unlocks the format
*Ready to put these techniques into practice? Try Splice — film.fun's AI Creator Studio. Generate video, edit in the browser, and bring your stories to life.



