Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While diffusion models excel at generating high-quality images from text
prompts, they struggle with visual consistency when generating image sequences.
Existing methods generate each image independently, leading to disjointed
narratives - a challenge further exacerbated in non-linear storytelling, where
scenes must connect beyond adjacent images. We introduce a novel beam search
strategy for latent space exploration, enabling conditional generation of full
image sequences with beam search decoding. In contrast to earlier methods that
rely on fixed latent priors, our method dynamically samples past latents to
search for an optimal sequence of latent representations, ensuring coherent
visual transitions. As the latent denoising space is explored, the beam search
graph is pruned with a cross-attention mechanism that efficiently scores search
paths, prioritizing alignment with both textual prompts and visual context.
Human and automatic evaluations confirm that BeamDiffusion outperforms other
baseline methods, producing full sequences with superior coherence, visual
continuity, and textual alignment.