Head Forcing

Long Autoregressive Video Generation
via Head Heterogeneity

1AGI Lab, Westlake University    2University of California at Merced    3StepFun
*Corresponding author

Abstract

Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines.

Key Discovery: Attention Head Heterogeneity

We discover that attention heads in AR video diffusion transformers serve functionally distinct roles, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We identify three distinct head types:

🔍
Local Heads
Focus on the current block and its immediate neighborhood for detail refinement and short-range motion continuity.
Anchor Heads
Exhibit elevated first-frame attention, using the initial frame as a structural anchor to prevent visual collapse.
🧠
Memory Heads
Attend broadly across the full context to capture narrative elements and sustain long-range memory.
Representative attention patterns for different attention heads

Representative attention patterns for different attention heads.

Attention proportion clustering

Attention proportion for each head shows clear clustering into local, anchor, and memory heads.

Method

Based on our head profiling analysis, we propose Head Forcing, a training-free, head-wise framework built on pretrained AR video DiTs. It assigns each attention head type a tailored KV cache strategy, enabling robust minute-level video generation and long-term consistent prompt-guided interactive generation.

Head Forcing Pipeline
🔍
Local Heads retain only the nearest neighboring frame and the current block for detail synthesis and motion continuity, freeing cache budget for memory heads.
Anchor Heads preserve the first few latent frames alongside local tokens to stabilize generation, leveraging the distinctive statistics of the initial image.
🧠
Memory Heads employ a hierarchical memory system with fast memory for immediate temporal neighborhood and episodic memory storing representative KV cache from distinct scenes with dynamic novelty-based updates.
🔄
Head-wise RoPE Re-encoding assigns contiguous frame indices to each head's assembled key sequence, ensuring all temporal relative positions remain within the pretrained range regardless of generation length.

Generation Results

Head Forcing extends autoregressive video generation from 5 seconds to minute-level duration without noticeable quality degradation. Below we showcase single-prompt long video generation, multi-prompt interactive video generation, and ultra-long 5-minute generation results.

Single-Prompt Long Video Generation (60s)

Hover on a video to view its prompt.

💬 A close-up 3D animated scene of a short, fluffy monster kneeling beside a melting red candle, gazing at the flame with wonder and curiosity, soft fluffy fur lit by warm dramatic lighting in a cozy fireplace-lit room.
💬 A gritty, realistic photo of a middle-aged firefighter giving a thumbs-up in front of a partially collapsed burning building, weathered face showing determination amid flames, smoke, and emergency vehicles.
💬 A dynamic urban alleyway scene capturing a cyclone of broken glass swirling through the narrow space, glass pieces twirling and scattering in a dimly lit, graffiti-covered alley with motion blur.
💬 A nature photo of a family of orangutans along the Kinabatangan River in Borneo: a mother holding her baby tightly, a larger father nearby, all set against lush tropical rainforest reflected in the water.

Multi-Prompt Interactive Generation (60s)

Each video is produced from six time-stamped sub-prompts that progressively guide the narrative.

0–10sA young wizard with a pointed hat and sparkling robe practices magic in an ancient stone-walled chamber, levitating objects with his wooden wand. 10–20sA small green-winged dragon flies in through a high window and lands on his shoulder. The wizard looks surprised but pleased. 20–30sThe wizard points his wand at a dusty book while the dragon puffs smoke, causing the book to flip open. 30–40sThe dragon sneezes a shower of harmless sparks onto the floating objects; the wizard pats its head affectionately. 40–50sThe wizard chants while the dragon flaps its wings, forming a swirling ball of light between them. 50–60sThe ball of light floats to the ceiling and bursts into magical stars; they share a triumphant smile.
0–10sA female chef in a white toque meticulously plates a dish in a busy, stainless-steel restaurant kitchen, placing a microgreen with tweezers. 10–20sA famous food critic with a stern expression and a notepad walks in from the dining room. The chef looks up, anxious. 20–30sShe explains the dish while he listens intently, scrutinizing every detail on the plate. 30–40sThe critic takes a small bite, chews thoughtfully, his expression unreadable. 40–50sHis expression softens into a smile; he makes a note and gives her a respectful nod of approval. 50–60sThey shake hands; he says complimentary words before turning to leave and she beams with pride.
0–10sA cartoon mouse with large ears and overalls is stacking cheese blocks to build a castle on a brightly colored kitchen floor. 10–20sA sleek cartoon cat with a mischievous grin peeks around the corner of the kitchen counter, eyeing the castle. 20–30sThe mouse stands defensively in front of his castle as the cat saunters closer, licking its lips. 30–40sThe mouse bravely squeaks at the cat, who simply smirks and raises a paw. 40–50sThe cat gently places a final tiny block of cheese atop the castle, completing it. The mouse opens its eyes in shock. 50–60sMouse and cat sit side-by-side admiring the castle; the cat winks and the mouse offers it a piece of cheese.
0–10sAn explorer in khaki clothing hacks through thick jungle vines with a machete; ancient stone ruins are partially visible. 10–20sHe uncovers a hidden temple entrance; a large moss-covered statue of a jaguar guardian stands beside it. 20–30sThe statue's eyes begin to glow green and it slowly turns its head toward the explorer. 30–40sThe explorer cautiously approaches; the statue gestures toward the temple, as if granting permission. 40–50sThe explorer gives a respectful nod and bravely steps into the dark entrance of the temple. 50–60sThe statue returns to its original pose, eyes no longer glowing; once again a silent moss-covered sentinel.

Ultra-Long Generation (5 Minutes)

Head Forcing scales beyond 60s and produces coherent 5-minute generations from a single prompt with stable subject identity and consistent style.

⏱ 5 min · single-prompt
Prompt: A vibrant and whimsical digital illustration in a cartoon style, depicting a giant humanoid figure composed of fluffy blue cotton candy. The humanoid stomps its feet on the ground and roars toward the clear blue sky, with expressive eyes and a mischievous smile, arms and legs made of swirling cotton candy — a dynamic, full-body shot from a slightly elevated angle.

Comparisons

Compared to both training-based methods (Rolling Forcing, LongLive, CausVid, Self Forcing) and training-free methods (Deep Forcing, Infinity-RoPE), Head Forcing consistently outperforms across nearly all VBench metrics while maintaining comparable throughput. We provide both qualitative side-by-side comparisons and quantitative VBench evaluations below.

Qualitative Comparison · Single-Prompt (60s)

Click an example to switch the prompt and the corresponding 7-method side-by-side comparison.

Example 1
Example 2
Example 3
Example 4
Prompt: A vibrant concert stage scene featuring a woman in the spotlight, singing passionately. She stands confidently with a microphone, wearing a stylish form-fitting black dress with silver embroidery. Bright backlighting creates a dramatic silhouette, with colorful stage lights and blurred event banners behind her. Dynamic medium shot from a slightly elevated angle.
CausVid
Self Forcing
Rolling Forcing
Deep Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)
Prompt: A photorealistic close-up of two pirate ships battling each other as they sail inside a steaming cup of coffee. The ships are intricately detailed with wooden planks, flapping sails and cannons; pirate crews brandish swords and pistols. Coffee foam creates a frothy turbulent sea with realistic ripples. The camera is slightly elevated, capturing the intense action from above.
CausVid
Self Forcing
Rolling Forcing
Deep Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)
Prompt: A vibrant and whimsical digital illustration in a cartoon style, depicting a giant humanoid figure composed of fluffy blue cotton candy.
CausVid
Self Forcing
Rolling Forcing
Deep Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)
Prompt: A stunning mid-afternoon landscape photograph with a low camera angle, showcasing several giant wooly mammoths treading through a snowy meadow. Their long wooly fur billows in the brisk wind. Snow-covered trees and dramatic snow-capped mountains loom in the distance; wispy clouds and a high sun cast a warm glow over the scene.
CausVid
Self Forcing
Rolling Forcing
Deep Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)

Qualitative Comparison · Multi-Prompt (60s)

Each example is driven by six time-stamped sub-prompts. We compare against the strongest baselines.

Example 1
Example 2
Example 3
Example 4
0–10sA painter in a white apron works on a large, abstract canvas in a bright, sunlit art studio — the canvas is a swirl of blue and green. 10–20sA renowned elderly art critic with a cane enters the studio and begins to inspect the work silently. 20–30sThe painter explains her vision while the critic listens without expression. 30–40sThe critic leans in close to the canvas to examine a single brushstroke; the studio is silent. 40–50sHe breaks into a rare smile and declares the work a masterpiece of modern expression. 50–60sThe painter and critic share a glass of wine; he offers to sponsor her next exhibition.
Self Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)
0–10sA male player in his late 30s in a fitted navy blazer sits at a well-lit poker table, gripping his hole cards with a tense expression. 10–20sHe flicks his cards onto the felt and leans back with arms spread wide in celebration. 20–30sHe reveals the winning hand; a nearby patron claps and cheers, amplifying the festive atmosphere. 30–40sHe sits upright and methodically arranges the stacks of multicolored chips in front of him. 40–50sHe glances over his chips and breaks into a proud, self-assured smile. 50–60sHe shares a celebratory high-five with a nearby patron as laughter and cheers ripple around the table.
Self Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)
0–10sA tired programmer codes intensely on a laptop in a dark, messy apartment late at night; multiple empty coffee mugs sit on the desk. 10–20sA friendly bin-shaped robot assistant glides into the room holding a fresh, steaming cup of coffee. 20–30sThe programmer takes the coffee with a grateful smile; the robot's digital eyes form a happy expression. 30–40sThe programmer sips the coffee while the robot gently tidies up the empty mugs on the desk. 40–50sThe two share a silent moment of companionship; the programmer returns to coding, re-energized. 50–60sThe programmer finishes the code and leans back; the robot gives an encouraging pat on the shoulder.
Self Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)
0–10sA little boy in a blue T-shirt stands on a lush green lawn, arms relaxed at his sides, smiling at the camera. 10–20sHe starts to step forward, lightly running with arms swinging naturally; the grass and blue sky remain unchanged. 20–30sHe runs faster, leaning slightly forward, feet lightly off the ground. 30–40sHe jumps into the air with knees bent, arms raised in a light jump. 40–50sHe lands and continues running forward, arms swinging naturally, smiling. 50–60sHe stops, hands on hips with a smile, body leaning slightly forward in sunlight on the grass.
Self Forcing
Infinity-RoPE
LongLive
Head Forcing (Ours)

Quantitative Results on Long Video Generation

Model Dynamic Degree ↑ Motion Smoothness ↑ Temporal Flickering ↑ Imaging Quality ↑ Aesthetic Quality ↑ Subject Consistency ↑ Background Consistency ↑
Training-based · 30 seconds
Rolling Forcing33.9298.8098.6270.2161.2698.1297.01
LongLive40.7298.8398.8069.2261.4497.9797.15
CausVid32.2198.2198.4765.7859.8797.6496.84
Self Forcing35.2298.4298.5868.2460.1697.6296.77
Training-free · 30 seconds
Deep Forcing36.4298.3198.5969.3360.7097.8596.95
Infinity-RoPE38.2798.6498.7969.1660.8297.7797.06
Head Forcing (Ours) 42.14 98.76 98.78 70.30 61.68 98.07 97.08
Training-based · 60 seconds
Rolling Forcing32.8698.6998.5770.0960.9797.8796.87
LongLive40.2998.7598.7069.0861.1197.7296.92
CausVid31.0898.2698.4965.3659.3297.5396.66
Self Forcing31.9298.2198.5667.3357.1797.3296.53
Training-free · 60 seconds
Deep Forcing35.7598.3898.6368.9360.3997.7096.82
Infinity-RoPE37.1998.4998.6168.6859.6597.7396.88
Head Forcing (Ours) 41.37 98.67 98.72 69.27 61.36 97.81 96.90

Ablation Studies

We conduct ablation studies on 60s generation to evaluate each component's contribution. Head-wise KV cache allocation (HW) prunes irrelevant context from local and anchor heads while giving memory heads larger context windows. Hierarchical memory (HM) replaces sliding windows with fast + episodic memory for long-range consistency. RoPE re-encoding keeps positional encodings within the pretrained range for temporal stability.

Configuration HW HM RoPE Dynamic Degree ↑ Motion Smoothness ↑ Subject Consistency ↑ Imaging Quality ↑
(A) Baseline 31.9298.4297.6267.33
(B) + HW 37.2298.4997.6568.14
(C) + HM 39.0998.5497.7368.93
(D) Full (Ours) 41.37 98.69 97.81 69.27

BibTeX

@inproceedings{tian2026headforcing,
  title     = {Head Forcing: Long Autoregressive Video Generation
               via Head Heterogeneity},
  author    = {Jiahao Tian and Yiwei Wang and Gang Yu and Chi Zhang},
  booktitle = {arXiv preprint},
  year      = {2026}
}