Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Modern Text-to-Image (T2I) generation increasingly relies on token-centric
architectures that are trained with self-supervision, yet effectively fusing
text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a
unified multimodal framework that encodes images and captions into discrete
visual and textual tokens, processed by a joint-embedding predictive
Transformer. To enhance fusion, we incorporate cross-attention after the
feature predictor for conditional denoising while maintaining a task-agnostic
backbone. Additionally, raw texts embeddings are injected prior to the flow
matching loss to improve alignment during training. During inference, the same
network performs both class-conditional and free-text image generation by
iteratively denoising visual tokens conditioned on text. Evaluations on
ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency,
open-vocabulary generalization, and consistently outperforms non-fusion and
late-fusion baselines. Our approach shows that late architectural fusion
combined with objective-level alignment offers an effective balance between
conditioning strength and backbone generality in token-based T2I.The code is
now available: https://github.com/justin-herry/JEPA-T.git
Key Contributions
Introduces JEPA-T, a unified multimodal framework for T2I generation that effectively fuses text and visual tokens using a joint-embedding predictive Transformer. It enhances fusion with cross-attention for conditional denoising and injects raw text embeddings before the flow matching loss, achieving strong data efficiency and open-vocabulary generalization.
Business Value
Enables creation of highly customized and diverse visual content from textual descriptions, powering applications in marketing, design, entertainment, and personalized content generation.