AnyFlow: Continuous Mapping in Vision
- AnyFlow is a dual-concept framework applying continuous mapping techniques to enable arbitrary-scale optical flow estimation and any-step video diffusion.
- The optical flow variant incorporates an implicit neural representation, multi-scale feature warping, and dynamic correlation lookup to capture fine motion details under varied resolutions.
- The video diffusion system distills flow-map transitions over arbitrary time intervals, reducing discretization error and ensuring consistent generative sampling.
Searching arXiv for the two "AnyFlow" papers to ground the article in the cited literature. “AnyFlow” denotes two distinct research systems that share a name but address different technical problems. In optical flow estimation, AnyFlow is a network that “for the first time, treats the flow field itself as a continuous function of image coordinates,” enabling accurate motion prediction at arbitrary spatial scales from low-resolution inputs (Jung et al., 2023). In video generation, AnyFlow is “the first any-step video diffusion distillation framework based on flow maps,” designed to support arbitrary inference budgets while preserving the test-time scaling behavior of probability-flow ODE sampling (Gu et al., 13 May 2026). The shared term therefore refers not to a single unified framework, but to two separate lines of work centered on continuity across scales or time intervals.
1. Name and scope
The 2023 AnyFlow paper addresses optical flow under input resizing, especially the degradation of small-object and boundary accuracy when images are downsized for efficiency (Jung et al., 2023). Its central claim is that optical flow can be represented as a continuous coordinate-based function and queried at arbitrary output resolutions. The method is built on a RAFT-style iterative refinement backbone augmented with an implicit neural representation, multi-scale feature warping, and dynamic correlation lookup.
The 2026 AnyFlow paper addresses few-step and any-step video diffusion distillation (Gu et al., 13 May 2026). Its starting point is the observation that consistency-distilled video models often degrade when more sampling steps are used at test time. AnyFlow replaces endpoint-only consistency with flow-map transition learning over arbitrary time intervals, together with on-policy flow map distillation.
This suggests that the common semantic thread behind the two systems is not task overlap but a methodological emphasis on continuous mappings: spatial continuity in the optical-flow variant and temporal-transition continuity in the diffusion variant.
2. AnyFlow for arbitrary-scale optical flow
In the optical-flow formulation, the flow field is modeled as a continuous function
where is any positive scale factor, with for downsize and for upsample (Jung et al., 2023). The method takes a low-resolution image pair as input and can directly produce flow at the original resolution, a down-sampled resolution, or a super-resolved resolution without relying on naive interpolation.
After the encoder and GRU update inherited from RAFT produce a hidden state and an accumulated low-resolution flow , AnyFlow upsamples to an arbitrary output resolution via a small MLP (Jung et al., 2023). For a continuous query coordinate in , the network locates its nine nearest integer neighbors in the coarse flow grid, identifies the nearest feature 0 at integer coordinate 1, and feeds the tuple
2
into the MLP, where 3 is the positional encoding from LIIF. The output
4
is interpreted as convex weights on the 5 local neighbors of 6 together with an arrangement that simultaneously produces an 7 patch of higher-resolution flow (Jung et al., 2023). By sampling 8 query points and assembling patches, the method reconstructs 9 at resolution 0 in one shot.
A key consequence is that 1 and 2 may be chosen arbitrarily, including non-integer scalings, without changing the model (Jung et al., 2023). The paper presents this as a departure from fixed-grid optical flow prediction.
3. Optical-flow architecture, training, and inference
AnyFlow extends RAFT’s iterative design with multi-scale feature warping (Jung et al., 2023). Whereas RAFT warps only at 3 scale, AnyFlow also extracts features 4 and 5, in addition to 6. After each GRU iteration 7, the current flow estimate is upsampled via the implicit upsampler to obtain 8 and 9, and second-frame features are warped:
0
These warped features are concatenated with first-frame features, processed by a 1 convolution, and then PixelShuffle is used to bring them back to the 2 grid:
3
4
5
According to the paper, this injects high-frequency spatial cues into the update and improves boundary localization and tiny-object capture (Jung et al., 2023).
The method also replaces RAFT’s fixed-radius correlation lookup with a dynamic lookup strategy (Jung et al., 2023). At each iteration and for each pixel, a residual radius 6 is predicted and accumulated as 7. The local search window therefore grows or shrinks per pixel depending on motion magnitude, while the number of sampled points remains 8. To address blind spots when 9 becomes large, the model defines nine auxiliary sub-pixel offsets around each sample location, collects their correlation values, and feeds them into a small MLP 0 together with 1. The paper terms this “region encoding.”
Training follows the original RAFT loss:
2
In all experiments, 3 updates during training and 4, and no additional photometric or smoothness terms are used (Jung et al., 2023). To encourage robustness to arbitrary scales, with probability 5 the input image pair is downsampled by an independently and uniformly chosen scale 6, typically down to 50%–90% of the original, while the model is required to recover the flow at the original resolution.
At inference, the procedure is explicit: optionally downsample the original frames; encode multi-scale features; build 7-scale all-pairs correlations; initialize 8 and 9; iterate GRU updates and dynamic correlation sampling; upsample each 0 to arbitrary scale 1 via 2; and return the final upsampled flow (Jung et al., 2023). The implicit upsampler 3 and region-encoding module 4 add approximately 5 M parameters beyond RAFT’s 6 M, and the runtime overhead is described as small.
4. Empirical profile of the optical-flow model
The optical-flow AnyFlow is reported to establish “a new state-of-the-art performance of cross-dataset generalization on the KITTI dataset,” while achieving comparable benchmark accuracy to other SOTA methods (Jung et al., 2023). In cross-dataset generalization experiments trained on FlyingChairs+Things only, RAFT achieves EPE 7 on Sintel-clean, while AnyFlow (dynamic) reduces this to 8, and region encoding further reduces it to 9 (Jung et al., 2023). On KITTI-train, RAFT has F1-epe 0 and F1-all 1; AnyFlow (dynamic) achieves 2 EPE and 3 outliers, and region encoding yields 4.
After fine-tuning, the paper reports that on the public Sintel test set AnyFlow (dynamic) scores 5 EPE on “clean” and 6 on “final,” compared with RAFT’s 7 and 8 respectively (Jung et al., 2023). On KITTI-test, AnyFlow (dynamic) records 9 F1-all versus 0 for RAFT. The paper characterizes this as second best overall and top among non-Transformer, non-ImageNet-pretrained methods.
A major emphasis is robustness to input downsampling (Jung et al., 2023). When methods are fed inputs downsampled by 50%–90% and evaluated after re-upsampling flow to original size, RAFT’s Sintel-clean EPE rises from 1 at 100% to 2 at 50%, while AnyFlow rises from 3 to 4. On KITTI, RAFT rises from 5 to 6, whereas AnyFlow rises from 7 to 8. The paper interprets this stability as evidence for the advantage of continuous flow representation.
Qualitatively, the reported comparisons indicate that when inputs are halved or lower, RAFT, GMA, and GMFlow lose fine boundaries around wheels, limbs, or thin rods, and small motions vanish entirely, whereas AnyFlow retains crisp object outlines and captures tiny displacements (Jung et al., 2023). A plausible implication is that the model is especially relevant in resource-constrained settings where aggressive input resizing is operationally necessary.
5. AnyFlow for any-step video diffusion
In the 2026 work, AnyFlow denotes a video diffusion distillation framework based on flow maps rather than endpoint consistency (Gu et al., 13 May 2026). A flow map model parameterized by 9 implements
0
with 1. Endpoint consistency, written as 2, is treated as the special case 3.
The underlying teacher model is described by the probability-flow ODE
4
where 5 is the teacher score network and 6 is the noise schedule (Gu et al., 13 May 2026). The distilled model is trained to approximate the exact ODE solution mapping
7
so that 8.
The paper’s central claim is that distilling full ODE transitions rather than only endpoints preserves the desirable property that increasing the number of solver steps at test time monotonically improves approximation of the ODE trajectory (Gu et al., 13 May 2026). Test-time sampling can therefore use an Euler scheme with 9 steps:
0
The framework is presented as the first any-step video diffusion distillation method built around this two-time transition formulation.
6. Distillation, backward simulation, and empirical results in video generation
After an initial forward flow-map training stage, the diffusion AnyFlow refines 1 via an on-policy Distribution-Matching Distillation loss (Gu et al., 13 May 2026). For a sampled interval 2 derived from a target budget 3, the model produces a three-segment rollout:
4
It then re-noises 5 back to 6 via the known forward process and minimizes the KL divergence between teacher and student predicted distributions at 7:
8
The expectation is over the target step budget 9, the sampled split, initial noise, and prompt context (Gu et al., 13 May 2026).
The paper further introduces Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions (Gu et al., 13 May 2026). In practice, shortcut calls 00 replace long multistep Euler rollouts and thereby reduce training cost. The method is motivated by two failure modes: discretization error in few-step sampling and exposure bias in causal generation. By training over all intervals 01 and using on-policy rollouts, AnyFlow is reported to correct coarse-step ODE errors and reduce drift in causal models.
The implementation spans both bidirectional and causal architectures (Gu et al., 13 May 2026). The bidirectional variant uses a standard UNet backbone, Wan2.1, distilled via the AnyFlow pipeline. The causal variant uses FAR context compression with three “full-token” chunks at patch size 2, the remainder at patch size 4, and a first chunk of size 1 for precise first-frame conditioning. KV caches are reused between flow-map calls to speed up backward simulation. Reported parameter scales are 1.3 B for Wan2.1-1.3B and 14 B for Wan2.1-14B.
The paper also specifies forward training tricks and hyperparameters (Gu et al., 13 May 2026). These include interpolated timestep conditioning with 02 and 03, guidance-fused training, adaptive loss reweighting anchored at boundary cases 04, and a time sampler uniform over 05 with an optional timestep reweight function 06. Typical hyperparameters are AdamW, learning rate 07 in stage 1 and 08 in stage 2, weight decay 09, batch size 10 GPUs times 11 samples each, stage 1 of approximately 12 K iterations, stage 2 of approximately 13 K iterations, and guidance scale matched to the pretrained teacher.
Selected VBench totals are reported for the 14 B models (Gu et al., 13 May 2026). For bidirectional text-to-video, rCM at 4 NFE scores 14, AnyFlow at 4 NFE scores 15, and AnyFlow at 32 NFE scores 16. For causal text-to-video, Krea-Realtime at 4 NFE scores 17, AnyFlow-FAR at 4 NFE scores 18, and AnyFlow-FAR at 32 NFE scores 19. For image-to-video, Wan2.1-I2V at 100 NFE scores 20, while AnyFlow-FAR at 4 NFE scores 21. The paper interprets these results as evidence that AnyFlow matches or surpasses consistency-based counterparts in the few-step regime while scaling with sampling step budgets.
7. Relationship, distinctions, and limitations
The two AnyFlow systems are unrelated at the task level. One is an optical-flow estimator for dense motion between image pairs; the other is a video diffusion distillation framework for generative sampling (Jung et al., 2023, Gu et al., 13 May 2026). The shared name can therefore be misleading if treated as designating a single method family.
Their methodological parallel lies in representing outputs through continuous mappings rather than fixed discrete endpoints. In the optical-flow case, the continuity is spatial and coordinate-based, allowing arbitrary-scale querying of flow fields (Jung et al., 2023). In the diffusion case, the continuity is temporal and interval-based, allowing arbitrary transition learning between times 22 and 23 and supporting any-step inference (Gu et al., 13 May 2026). This suggests that “AnyFlow” functions as a label for models that seek robustness under variable resolution or variable solver budget by learning over continuous domains.
The limitations discussed in the two papers are likewise domain-specific. For optical flow, the iterative GRU backbone remains the dominant compute cost; the region encoding and dynamic-radius MLPs introduce hyperparameters such as the initial 24 and auxiliary patch size; and the authors identify unsupervised or self-supervised photometric losses and integration with Transformer-style global matching as future directions (Jung et al., 2023). For video diffusion, the paper frames its contribution against the limitations of consistency distillation and positions flow-map learning as a remedy for degraded performance at higher sampling budgets, but its concrete claims are primarily about on-policy distillation, ODE-consistent scaling, and broad applicability across bidirectional and causal architectures rather than about unresolved weaknesses (Gu et al., 13 May 2026).
Taken together, the term “AnyFlow” currently refers to two separate contributions that use continuity as a design principle to overcome rigidity in standard formulations: fixed-grid prediction in optical flow, and fixed-step or endpoint-only distillation in video diffusion.