SeFA-Policy: Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment

Imitation learning relies on accurate action predictions and fast inference to successfully perform complicated real-world tasks. Generative modeling techniques such as Diffusion Policy[1] have achieved strong performance in complex manipulation tasks, but their reliance on multi-step iterative denoising makes them computationally expensive and unsuitable for real-time control. Flow-based models have emerged as a promising alternative, enabling fewer-step action generation and rectification[2] by transporting nearly straight from noise to action space, thus significantly reducing inference latency. Despite the efficiency, few-step sampling introduces discretization error, and rectification introduces inconsistency between observation and action during distillation. When applying rectification, the reflow policy[2] is trained upon the noise-action pairs generated by a well-trained policy. However, the generated actions are not the same as ground-truth actions. In terms of diffusion-based models, the predicted action is distinct from the actions reflected in the visual condition. Such inconsistency might be tolerable in image generation where perceptual similarity suffices, but in robotic control, even minor inconsistencies between observations and actions can accumulate and lead to task failures. This distillation-induced inconsistency therefore represents a fundamental barrier to deploying flow-based policies in effective real-time visuomotor control.

To overcome this limitation, we propose Selective Flow Alignment (SeFA), a flow-based visuomotor policy with a selective alignment strategy. The straight paths in flow-based models are computationally efficient because they can be sampled in a few or even one step. However, the straightening process[2] often accumulates errors from the base model, which leads to the observation–action inconsistency. SeFA leverages expert demonstrations to align the sampling paths with observations while maintaining the straightness of the paths. Crucially, this alignment is applied in a selective manner, preserving action diversity and multimodality while eliminating harmful mismatches. By combining efficiency in straight sampling paths with observation-consistent alignment, SeFA-Policy enables one-step action synthesis that is both fast and reliable for real-time visuomotor control.

Leveraging nearly straight flows, SeFA-Policy achieves high accuracy with just a single denoising step. To evaluate the effectiveness of SeFA-Policy, we conducted extensive experiments across both simulated and real-world tasks. Results show that our method matches or surpasses the performance of state-of-the-art diffusion-based methods while while offering greater simplicity and computational efficiency. Compared to Diffusion Policy, which involves numerous iterative steps and incurs significant computational overhead, our approach offers a streamlined and scalable solution for real-time visuomotor policy learning.

SeFA-Policy
:
Fast and Accurate Visuomotor Policy Learning with
Se
lective
F
low
A
lignment

SeFA-Policy is a visual imitation learning algorithm that utilizes rectified flow with selective alignment, achieving superior effectiveness in diverse simulation and real-world tasks, with a significant inference acceleration.

Real Robot Demonstration

Rice Pouring

Knob Pull

Coffee Bean Sweeping

Drawer Close

Moving Object Picking

Putting Apple in the Bowl

Comparison with Baseline

Flower Insertion (ours)

Flower Insertion (baseline)

Rice Pouring (ours)

Rice Pouring (baseline)

Drawer Close (ours)

Drawer Close (baseline)

Abstract

Cite

BibTeX

SeFA-Policy:Fast and Accurate Visuomotor Policy Learning with Selective Flow Alignment