Autoregressive video diffusion models support real-time synthesis but suffer from error accumulation and context loss over long horizons. We discover that attention heads in AR video diffusion transformers serve functionally distinct roles as local heads for detail refinement, anchor heads for structural stabilization, and memory heads for long-range context aggregation, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We propose Head Forcing, a training-free framework that assigns each head type a tailored KV cache strategy: local and anchor heads retain only essential tokens, while memory heads employ a hierarchical memory system with dynamic episodic updates for long-range consistency. A head-wise RoPE re-encoding scheme further ensures positional encodings remain within the pretrained range. Without additional training, Head Forcing extends generation from 5 seconds to minute-level duration, supports multi-prompt interactive synthesis, and consistently outperforms existing baselines.
We discover that attention heads in AR video diffusion transformers serve functionally distinct roles, yet existing methods treat them uniformly, leading to suboptimal KV cache allocation. We identify three distinct head types:
Representative attention patterns for different attention heads.
Attention proportion for each head shows clear clustering into local, anchor, and memory heads.
Based on our head profiling analysis, we propose Head Forcing, a training-free, head-wise framework built on pretrained AR video DiTs. It assigns each attention head type a tailored KV cache strategy, enabling robust minute-level video generation and long-term consistent prompt-guided interactive generation.
Head Forcing extends autoregressive video generation from 5 seconds to minute-level duration without noticeable quality degradation. Below we showcase single-prompt long video generation, multi-prompt interactive video generation, and ultra-long 5-minute generation results.
Hover on a video to view its prompt.
Each video is produced from six time-stamped sub-prompts that progressively guide the narrative.
Head Forcing scales beyond 60s and produces coherent 5-minute generations from a single prompt with stable subject identity and consistent style.
Compared to both training-based methods (Rolling Forcing, LongLive, CausVid, Self Forcing) and training-free methods (Deep Forcing, Infinity-RoPE), Head Forcing consistently outperforms across nearly all VBench metrics while maintaining comparable throughput. We provide both qualitative side-by-side comparisons and quantitative VBench evaluations below.
Click an example to switch the prompt and the corresponding 7-method side-by-side comparison.
Each example is driven by six time-stamped sub-prompts. We compare against the strongest baselines.
| Model | Dynamic Degree ↑ | Motion Smoothness ↑ | Temporal Flickering ↑ | Imaging Quality ↑ | Aesthetic Quality ↑ | Subject Consistency ↑ | Background Consistency ↑ |
|---|---|---|---|---|---|---|---|
| Training-based · 30 seconds | |||||||
| Rolling Forcing | 33.92 | 98.80 | 98.62 | 70.21 | 61.26 | 98.12 | 97.01 |
| LongLive | 40.72 | 98.83 | 98.80 | 69.22 | 61.44 | 97.97 | 97.15 |
| CausVid | 32.21 | 98.21 | 98.47 | 65.78 | 59.87 | 97.64 | 96.84 |
| Self Forcing | 35.22 | 98.42 | 98.58 | 68.24 | 60.16 | 97.62 | 96.77 |
| Training-free · 30 seconds | |||||||
| Deep Forcing | 36.42 | 98.31 | 98.59 | 69.33 | 60.70 | 97.85 | 96.95 |
| Infinity-RoPE | 38.27 | 98.64 | 98.79 | 69.16 | 60.82 | 97.77 | 97.06 |
| Head Forcing (Ours) | 42.14 | 98.76 | 98.78 | 70.30 | 61.68 | 98.07 | 97.08 |
| Training-based · 60 seconds | |||||||
| Rolling Forcing | 32.86 | 98.69 | 98.57 | 70.09 | 60.97 | 97.87 | 96.87 |
| LongLive | 40.29 | 98.75 | 98.70 | 69.08 | 61.11 | 97.72 | 96.92 |
| CausVid | 31.08 | 98.26 | 98.49 | 65.36 | 59.32 | 97.53 | 96.66 |
| Self Forcing | 31.92 | 98.21 | 98.56 | 67.33 | 57.17 | 97.32 | 96.53 |
| Training-free · 60 seconds | |||||||
| Deep Forcing | 35.75 | 98.38 | 98.63 | 68.93 | 60.39 | 97.70 | 96.82 |
| Infinity-RoPE | 37.19 | 98.49 | 98.61 | 68.68 | 59.65 | 97.73 | 96.88 |
| Head Forcing (Ours) | 41.37 | 98.67 | 98.72 | 69.27 | 61.36 | 97.81 | 96.90 |
We conduct ablation studies on 60s generation to evaluate each component's contribution. Head-wise KV cache allocation (HW) prunes irrelevant context from local and anchor heads while giving memory heads larger context windows. Hierarchical memory (HM) replaces sliding windows with fast + episodic memory for long-range consistency. RoPE re-encoding keeps positional encodings within the pretrained range for temporal stability.
| Configuration | HW | HM | RoPE | Dynamic Degree ↑ | Motion Smoothness ↑ | Subject Consistency ↑ | Imaging Quality ↑ |
|---|---|---|---|---|---|---|---|
| (A) Baseline | ✗ | ✗ | ✗ | 31.92 | 98.42 | 97.62 | 67.33 |
| (B) + HW | ✓ | ✗ | ✗ | 37.22 | 98.49 | 97.65 | 68.14 |
| (C) + HM | ✓ | ✓ | ✗ | 39.09 | 98.54 | 97.73 | 68.93 |
| (D) Full (Ours) | ✓ | ✓ | ✓ | 41.37 | 98.69 | 97.81 | 69.27 |
@inproceedings{tian2026headforcing,
title = {Head Forcing: Long Autoregressive Video Generation
via Head Heterogeneity},
author = {Jiahao Tian and Yiwei Wang and Gang Yu and Chi Zhang},
booktitle = {arXiv preprint},
year = {2026}
}