Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Latent generative models have shown remarkable progress in high-fidelity
image synthesis, typically using a two-stage training process that involves
compressing images into latent embeddings via learned tokenizers in the first
stage. The quality of generation strongly depends on how expressive and
well-optimized these latent embeddings are. While various methods have been
proposed to learn effective latent representations, generated images often lack
realism, particularly in textured regions with sharp transitions, due to loss
of fine details governed by high frequencies. We conduct a detailed frequency
decomposition of existing state-of-the-art (SOTA) latent tokenizers and show
that conventional objectives inherently prioritize low-frequency
reconstruction, often at the expense of high-frequency fidelity. Our analysis
reveals these latent tokenizers exhibit a bias toward low-frequency information
during optimization, leading to over-smoothed outputs and visual artifacts that
diminish perceptual quality. To address this, we propose a wavelet-based,
frequency-aware variational autoencoder (FA-VAE) framework that explicitly
decouples the optimization of low- and high-frequency components. This
decoupling enables improved reconstruction of fine textures while preserving
global structure. Moreover, we integrate our frequency-preserving latent
embeddings into a SOTA latent diffusion model, resulting in sharper and more
realistic image generation. Our approach bridges the fidelity gap in current
latent tokenizers and emphasizes the importance of frequency-aware optimization
for realistic image synthesis, with broader implications for applications in
content creation, neural rendering, and medical imaging.