The framework introduces several key technical components to handle the difficult case of misaligned or partially visible references. During training, One-to-All Animation reformulates the task as a self-supervised outpainting problem, where the model learns to transform diverse-layout reference inputs into a unified occluded-input representation and then generate the full character conditioned on driving poses. A dedicated reference extractor module is used to capture comprehensive identity features from incomplete or occluded reference regions, and these features are injected progressively through a hybrid reference fusion attention mechanism that can flexibly accommodate variable resolutions and dynamic sequence lengths in videos.[web:1]
From a control and quality perspective, One-to-All Animation incorporates an identity-robust pose control strategy that decouples appearance from skeletal structure to alleviate pose overfitting and reduce artifacts when the driving motion deviates strongly from the reference body configuration. In addition, a token replace strategy is applied for long video generation, which helps maintain temporal consistency and avoid identity drift over extended sequences. Extensive experiments reported by the authors indicate that this approach outperforms existing pose-driven animation baselines across cross-scale video animation, cross-scale image pose transfer, and long-form video generation, enabling a single character reference to be animated convincingly by multiple motions at different spatial scales.[web:1]

