Soft Reference Control and Keyframe Sampling in InfiniteTalk

How Infinite Talk AI keeps faces stable without freezing motion.

1. The problem with "hard" control

When you condition a generative video model on reference frames, you have to decide how strongly those references should influence each generated frame.

Two extremes:

Too weak control
- The model ignores the reference over time.
- Identity drifts, backgrounds morph, and the character stops looking like the source actor.
Too strong control
- The model copies the reference pose literally.
- Head and body movements are locked to the reference frame instead of following the audio.
- Performances look stiff and out of sync with speech.

Classic approaches tend to fall into one of these extremes:

Plain audio-driven I2V: weak control → free motion, but identity drift.
First–Last-frame-constrained video: hard control → stable identity, but rigid, pose-copying motion.

Sparse-frame video dubbing demands something more subtle:

We want references to lock identity and style, but still let the model move the whole body in sync with the dubbed audio.

This is where soft reference control comes in.

2. What is soft reference control?

In InfiniteTalk, "soft reference control" means:

Reference frames are used as soft anchors, not hard templates.
The model learns to adapt control strength based on:

How similar the current context is to the reference, and
Where in the sequence the reference is placed.

Practically, this results in:

High identity and background consistency
The actor keeps looking like themselves, even over long sequences.
Flexible head and body motion
Head turns, gestures, and posture are free to follow the audio.
Fewer "pose copy" artifacts
The model doesn't just freeze the user at the exact reference pose.

Instead of hand-tuning dozens of weights, InfiniteTalk learns this behavior through how reference frames are sampled during training.

3. Four sampling strategies: M0–M3

The InfiniteTalk paper explores four different strategies (M0–M3) for selecting reference frames during training. These strategies control where in time the reference comes from, relative to the chunk being generated.

A visual comparison between the training reference positioning strategies. All video chunks are generated using the same context frames and the same reference frame shown in below.

3.1 M0 — Random-in-chunk reference (too strong and misaligned)

In M0, the reference frame is sampled uniformly from within the same chunk that the model is trying to generate.

Pros:

The reference is always temporally close and visually similar.

Cons:

Control becomes too strong and too local:
The model tends to copy the reference pose even when it doesn't match the audio at that exact moment.
You can get situations where the character suddenly performs a "big gesture" at the wrong beat because it was in the reference frame.

Effectively, M0 encourages pose copying instead of audio-driven acting.

3.2 M1 — First/last frame reference (hard boundary locking)

In M1, the reference is always taken from the first or last frame of each chunk.

Pros:

Strongly stabilizes boundaries between chunks.
Reduces identity drift at chunk edges.

Cons:

This leads to hard locking at the chunk boundaries:
The model feels compelled to match the boundary pose very precisely.
Motion at the start and end of each chunk can look stiff or jerk back towards the reference frame.

M1 behaves like a "soft FL2V": better than pure FL2V, but still too rigid around chunk boundaries.

3.3 M2 — Distant-chunk reference (too weak)

In M2, reference frames are sampled from chunks that are far away in time (e.g., several seconds apart).

Pros:

Control is very soft; motion follows audio freely.

Cons:

Control is often too weak:
Identity and background consistency degrade over long sequences.
The model has little reason to stay close to the original look when the reference is temporally far.

M2 behaves a lot like plain audio-driven I2V: you get freedom, but you pay with drift.

3.4 M3 — Neighboring-chunk reference (best balance)

In M3, reference frames are sampled from neighboring chunks—close in time, but not always from the same chunk.

The reference remains visually similar to the current context.
It is not fixed to the exact boundary frame, so the model isn't forced to copy a specific pose.
Control strength becomes adaptive:

Strong enough to prevent drift.
Soft enough to let head and body motion follow the audio naturally.

In experiments, M3 yields the best overall trade-off across:

Lip-sync metrics (e.g., Sync-C / Sync-D).
Temporal consistency (FVD).
Identity similarity (CSIM).

This makes M3-like sampling the default behavior behind InfiniteTalk's soft reference control.

4. How soft reference control works at training time

At training time, InfiniteTalk learns soft reference control through its flow-matching / diffusion training objective combined with the reference sampling strategy:

For each training video:
- A target chunk is selected.
- Context frames (recent history) are selected.
- A reference frame is sampled using one of the M0–M3 strategies.
- Audio features for the current time span are extracted.
The model is trained to reconstruct the target chunk:
- Using the audio to drive lips, face, head, and body.
- Using context frames to maintain motion continuity.
- Using the reference frame as a soft constraint on appearance and camera.
Because M3 tends to produce the best performance:
- The model learns that references from neighboring chunks usually provide a good "anchor" without over-constraining the pose.
- Over many examples, it discovers how to balance audio-driven motion with reference-driven stability.

There is no explicit "control strength knob" in the architecture; instead, control strength emerges from how references are sampled in time.

5. What the ablation study shows

The paper includes an ablation comparing M0–M3 on long-form dubbing benchmarks. You don't need every number here, but the trends are:

M0 (same-chunk reference)
Over-constrained → good identity stability, but more pose-copy artifacts and worse sync metrics.
M1 (first/last-frame reference)
Strong boundary locking → less drift, but visible stiffness at chunk edges.
M2 (distant reference)
Under-constrained → freer motion, but identity and background stability degrade over time.
M3 (neighboring-chunk reference)
Best overall balance → strong identity & background stability, good lip and body sync, smooth motion across chunks.

Strategy	Reference position	Control strength	Typical issue
M0	Inside current chunk	Too strong / local	Pose copying, wrong beats
M1	First / last frame of chunk	Hard at boundaries	Stiff motion at chunk edges
M2	Distant chunks	Too weak	Identity / background drift
M3	Neighboring chunks	Balanced (soft)	Best overall sync & stability

For detailed numerical scores (FID, FVD, Sync-C/D, CSIM), see: /lib/benchmarks

6. What this means for creators

From a user's perspective, soft reference control has very concrete benefits:

Your character stays recognizable, even in long videos.
Their body and head can still act:
- Leaning in, looking away, nodding, gesturing
- Without being locked into whatever pose was in the reference image.
Chunk boundaries disappear:
- You can render long clips in segments without obvious seams at the joins.

This is especially important when:

You dub entire episodes, courses, or multi-part interviews.
The same character appears across dozens of clips and languages.
You want performances that feel alive, not like rigid puppets.

7. Summary and next steps

Hard reference control either drifts (too weak) or freezes motion (too strong).
InfiniteTalk uses soft reference control:

Context frames for continuity.
Sparsely sampled reference frames for stability.
Carefully chosen sampling strategies (like M3) to balance both.

This design lets Infinite Talk AI:

Maintain identity and scene consistency over long sequences.
Still let full-frame motion follow the dubbed audio.

To see how soft reference control fits into the overall system:

Read about the sparse-frame dubbing paradigm Learn how InfiniteTalk handles long sequences with context frames and chunking