How to Use Reference Images in Image-to-Video (2026): Lock Identity, Style, and Composition


TL;DR (3 steps)
- Choose the right reference type: identity reference for faces/characters, style reference for look and lighting, first-frame lock for composition and continuity.
- Structure your prompt clearly: separate subject, action, camera, and style while explicitly anchoring the reference.
- Iterate with small changes: adjust strength, frame consistency, and motion prompts instead of rewriting everything.
Intro
Reference images in image-to-video are the difference between videos that feel intentional and videos that feel random. Instead of relying entirely on text prompts, you give the model a visual anchor. That anchor helps preserve what actually matters: who the character is, how the scene looks, and how the frame is composed.
The challenge is that most modern tools handle references differently. Seedance 2.0, Kling 3.0, Veo 3, Sora, Runway, Pika, Luma, PixVerse, and Magic Hour all support some form of image conditioning, but they don’t interpret references in the same way. Some are better at locking identity, others at preserving style, and others at maintaining composition through first-frame control. If you don’t understand which type you’re using, you’ll run into common issues like facial drift, inconsistent lighting, or unstable framing.
In this guide, I’ll break down how to use reference images in a practical way. The focus is not just on what each setting does, but when to use identity reference vs style reference vs first-frame lock, how to structure prompts around them, and how to troubleshoot when things go wrong. The goal is simple: get predictable, repeatable results instead of guessing each time you generate a clip.
What you need (inputs/specs)

To get consistent results with reference images in image-to-video, you need more than just a good picture. The quality of your inputs determines how well models like Seedance 2.0, Kling 3.0, Veo 3, Sora, Runway, Pika, Luma, PixVerse, and Magic Hour can preserve identity and style across frames.
Start with a high-quality reference image. This should be at least 1024px on the shorter side, well-lit, and free of compression artifacts. For faces or characters, front-facing or three-quarter angles work best. Avoid heavy filters or motion blur in your base image because these artifacts often get amplified during generation.
Next, define your intent clearly. Are you trying to keep a character consistent across scenes, or are you trying to maintain a visual style across different subjects? These are different problems, and they require different reference strategies. Identity consistency relies on stable facial features and proportions, while style consistency depends on color grading, lighting, and texture cues.
You also need a tool that supports reference-based workflows. Most modern models support this in different ways. Some allow direct image conditioning, others use first-frame locking, and some combine both. If you want a flexible starting point, you can use the Magic Hour AI video generator or its image-to-video workflow.
Finally, prepare a structured prompt. This is where most users fail. Instead of writing a long descriptive paragraph, break your prompt into components: subject, action, camera, environment, and style. This makes it easier for the model to align your reference image with the generated motion.
Step-by-step: how to use reference images in image-to-video

Step 1: Decide which type of reference you actually need
There are three main types of references in image-to-video workflows, and choosing the wrong one is the fastest way to get inconsistent results.
Identity reference is used when you want the same character or face to persist across frames or scenes. This is critical for storytelling, UGC ads, or any content where the viewer needs to recognize the same person. Models like Runway, Pika, and Magic Hour handle this by anchoring facial features and proportions.
Style reference is used when you want a consistent visual language. This includes color palette, lighting, texture, and rendering style. For example, if you want a cinematic, low-key lighting look across multiple clips, a style reference ensures continuity even when subjects change.
First-frame lock is used when composition matters. This is especially useful for product shots, hero visuals, or scenes where framing should not drift. Models like Kling 3.0 and Veo 3 often support this by treating the first frame as a fixed anchor.
A simple rule:
- Use identity reference for people and characters
- Use style reference for visual consistency
- Use first-frame lock for composition and framing
If your use case combines all three, start with identity, then layer style, and only use first-frame lock when composition must not change.
Step 2: Prepare your reference image correctly
Even strong models like Sora or Luma will struggle if your reference image is weak. The goal is to give the model a clean, unambiguous signal.
For identity references, crop tightly around the subject. Remove distracting backgrounds if possible. The model should clearly understand what to preserve. If you include too many elements, it may “average out” details and lose identity.
For style references, include enough context. A close-up face won’t carry lighting or environment information well. Instead, use a wider shot that shows color grading, shadows, and texture.
For first-frame lock, treat your image as the starting frame of a video. Think about camera angle, subject placement, and depth. If the composition is off, every generated frame will inherit that problem.
A practical tip: run a quick still-image test first. If the model cannot reproduce your reference accurately in a single frame, it will not do better in motion.
Step 3: Use a structured prompt pattern
Most users rely too much on the reference image and neglect the prompt. In reality, the best results come from combining both.
A reliable prompt structure looks like this:
Subject: who or what is in the scene
Action: what they are doing
Camera: shot type, movement, lens
Environment: setting and background
Style: lighting, mood, rendering
Example:
A young woman with short black hair (matching reference image), walking through a neon-lit street at night, medium tracking shot, slight handheld motion, reflections on wet pavement, cinematic lighting, high contrast, shallow depth of field
The key is explicitly referencing the image. Phrases like “matching reference image” or “same character as reference” help reinforce identity anchoring in models like PixVerse and Runway.
If you are using Magic Hour text-to-video, you can combine this structure with image input for stronger control.
Step 4: Adjust reference strength and motion balance
Most tools expose a parameter that controls how strongly the reference image influences the output. This is often called “image strength,” “guidance,” or “reference weight.”
If the output drifts too far from your reference, increase the strength. If the motion looks stiff or unnatural, reduce it slightly. There is always a trade-off between fidelity and motion.
For identity consistency, keep the strength relatively high. For style references, moderate strength often works better because it allows variation while preserving the overall look.
Motion prompts also matter. If you describe complex actions without adjusting reference strength, the model may prioritize motion over identity. In that case, simplify the action or break it into shorter clips.
Step 5: Use first-frame lock when composition matters
First-frame lock is underused but extremely powerful. It ensures that the initial composition remains stable, which is critical for product videos and branded content.
When using this technique, write your prompt as if you are describing what happens after the first frame. Avoid redefining the subject or camera too much, because the model already has that information from the locked frame.
Example:
The product remains centered as the camera slowly pushes in, soft lighting shifts from left to right, subtle reflections on the surface
This works well in tools like Kling 3.0 and Veo 3, and also in Magic Hour’s video-to-video workflow.
Step 6: Iterate with small, controlled changes
The biggest mistake is rewriting the entire prompt after a bad result. This makes it impossible to understand what went wrong.
Instead, change one variable at a time. Adjust reference strength, tweak the action, or simplify the camera movement. This approach works consistently across tools like Runway, Pika, and Luma.
If identity breaks, increase reference strength or simplify motion. If style drifts, reinforce style keywords. If composition shifts, switch to first-frame lock.
Common mistakes and how to fix them
One of the most common issues is identity drift. The character looks correct in the first few frames, but slowly changes as the video progresses. This usually happens because the model prioritizes motion over consistency when the prompt becomes more complex. If you describe too many actions or camera movements at once, the model starts “rebuilding” the subject instead of preserving it. The fix is to increase reference strength, simplify the action, or split the sequence into shorter clips and stitch them together later.
Another frequent problem is over-constrained output. This happens when both the reference image and the prompt are too rigid. For example, if your reference already defines lighting, pose, and framing, and your prompt repeats or adds strict instructions on top of that, the result often looks stiff or unnatural. In this case, reduce either the prompt complexity or the reference strength. Let one source of control dominate instead of forcing both.
Style inconsistency is also easy to run into, especially when the reference image doesn’t carry enough visual information. A close-up face, for example, doesn’t communicate environment, color grading, or lighting direction clearly. As a result, the generated video may shift in tone across frames. The fix is to use a stronger style reference that includes background, lighting, and color context, or to reinforce style explicitly in the prompt.
Composition drift is another subtle but important issue. Even if the subject looks correct, the framing may shift slightly over time, which becomes very noticeable in product videos or ads. This usually happens when the model is free to reinterpret the scene. The most reliable fix is to use first-frame lock or reduce camera-related instructions so the model has less freedom to change the layout.
A less obvious mistake is using low-quality or ambiguous reference images. If the input image is blurry, heavily filtered, or contains multiple competing subjects, the model has to “guess” what to preserve. That guess often leads to inconsistent results. Always start with a clean, high-resolution image that clearly defines the subject or style you want.
Finally, many people iterate the wrong way. When a result looks off, they rewrite the entire prompt and change multiple variables at once. This makes it impossible to diagnose what actually caused the issue. A better approach is controlled iteration: adjust one variable at a time, whether it’s reference strength, motion complexity, or camera movement. This way, you can quickly converge on a stable setup.
“Good result” checklist
Before you finalize your output, check for these signals:
- The subject remains recognizable across all frames
- Lighting and color stay consistent
- Camera movement matches the prompt
- No sudden changes in proportions or facial features
- Background elements remain stable
If any of these fail, go back and adjust one variable at a time.
Variations you should try

Once you understand the basic workflow, there are a few variations that can significantly improve quality and consistency.
One effective approach is combining identity and style references. Instead of relying on a single image, you use one reference to lock the character and another to define the visual look. This works particularly well for branded content, where the subject must stay consistent but the environment or mood may change. The key is balancing their influence so one does not override the other.
Another variation is staged generation. Instead of generating a full, complex video in one pass, you break it into stages. First, generate a short clip with strong identity locking and minimal motion. Then, use video-to-video or extension tools to add movement, transitions, or effects. This reduces the risk of drift and gives you more control over each part of the process.
You can also experiment with prompt layering. Start with a simple prompt focused on identity and composition. Once you get a stable result, gradually introduce more detail, such as camera movement or environmental elements. This incremental approach is more reliable than trying to define everything upfront.
Scene chaining is another practical technique. Instead of forcing a single long generation, create multiple shorter clips using the same reference setup, then edit them together. This is often more stable across tools like Runway, Pika, and Luma, and it gives you more flexibility in pacing and storytelling.
Finally, try first-frame anchoring combined with subtle motion prompts. This works especially well for product shots or UI-style visuals. By locking the first frame and only introducing minimal movement, you get clean, controlled outputs that feel intentional rather than generated.
Each of these variations is about the same idea: reduce uncertainty. The more you guide the model with clear structure and controlled inputs, the more consistent and usable your results will be.
Decision guide: which reference method to use
If your priority is the character or face, use identity reference. This is the go-to for storytelling, UGC ads, or any content where the same person needs to appear across frames. Keep the reference image clean and increase reference strength if you notice facial drift.
If your priority is the overall look, use style reference. This works best for brand visuals, cinematic scenes, or mood-driven content where lighting, color, and texture matter more than the exact subject. Make sure your reference image includes enough context, not just a close-up.
If your priority is framing, use first-frame lock. This is especially useful for product shots or structured scenes where composition should not change. Instead of relying on prompts, you anchor the layout directly from the first frame.
In practice, you’ll often combine them. Start with identity if there’s a person involved, add style if you need a consistent visual tone, and only use first-frame lock when composition becomes unstable.
FAQs
What are reference images in image-to-video?
They are input images that guide the model to preserve certain attributes, such as identity, style, or composition, while generating motion.
Which tools support reference-based workflows?
Most modern tools do, including Seedance 2.0, Kling 3.0, Veo 3, Sora, Runway, Pika, Luma, PixVerse, and Magic Hour.
When should I use first-frame lock instead of identity reference?
Use first-frame lock when composition and framing must remain fixed. Identity reference is better for maintaining consistent characters.
Why does my character change during the video?
This usually happens because of low reference strength or overly complex motion prompts. Increasing strength and simplifying motion helps.
Can I use multiple reference images at once?
Yes, some tools support multi-reference inputs. This is useful for combining identity and style, but it requires careful balancing.
How do I get more cinematic results?
Focus on lighting and camera prompts. Style references with strong lighting cues often produce better cinematic outputs.
Is image-to-video better than text-to-video for consistency?
For identity and style consistency, image-to-video is generally more reliable because it provides a concrete visual anchor.






