Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
This paper proposes a plug-and-play Visual Self-Refinement module for autoregressive models, particularly in vision-language tasks. This module operates as a post-pretraining step to jointly refine all generated tokens, enhancing spatial correspondence modeling and mitigating error accumulation by leveraging global context across the sequence.
Leads to more coherent and accurate generated images and videos from autoregressive models, improving the quality of AI-generated content for creative and practical applications.