CoZ addresses the scalability bottleneck of modern SISR models, which deliver photo-realistic results at the scale factors on which they are trained but collapse when asked to magnify far beyond that regime. By using a vision-language model (VLM) to generate multi-scale-aware text prompts, CoZ can overcome the sparsity of the original input signal and produce more realistic images. The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference.
Experiments show that CoZ can achieve high-quality super-resolution results at extreme scales, outperforming conventional SR methods and other variants of CoZ with different text prompts. The use of GRPO fine-tuning of the VLM enhances human preference alignment, as validated by mean-opinion-score (MOS) tests for human-preferred image generation and human-preferred text generation. CoZ has the potential to be applied to various applications, such as image and video enhancement, and can be used to improve the quality of images and videos in various fields.