arxiv_cv 90% Match Research Paper AI researchers,Computer vision engineers,NLP researchers,Creative professionals 1 month ago

JEPA-T: Joint-Embedding Predictive Architecture with Text Fusion for Image Generation

generative-ai › diffusion

📄 Abstract

Abstract: Modern Text-to-Image (T2I) generation increasingly relies on token-centric architectures that are trained with self-supervision, yet effectively fusing text with visual tokens remains a challenge. We propose \textbf{JEPA-T}, a unified multimodal framework that encodes images and captions into discrete visual and textual tokens, processed by a joint-embedding predictive Transformer. To enhance fusion, we incorporate cross-attention after the feature predictor for conditional denoising while maintaining a task-agnostic backbone. Additionally, raw texts embeddings are injected prior to the flow matching loss to improve alignment during training. During inference, the same network performs both class-conditional and free-text image generation by iteratively denoising visual tokens conditioned on text. Evaluations on ImageNet-1K demonstrate that JEPA-T achieves strong data efficiency, open-vocabulary generalization, and consistently outperforms non-fusion and late-fusion baselines. Our approach shows that late architectural fusion combined with objective-level alignment offers an effective balance between conditioning strength and backbone generality in token-based T2I.The code is now available: https://github.com/justin-herry/JEPA-T.git

Key Contributions

Introduces JEPA-T, a unified multimodal framework for T2I generation that effectively fuses text and visual tokens using a joint-embedding predictive Transformer. It enhances fusion with cross-attention for conditional denoising and injects raw text embeddings before the flow matching loss, achieving strong data efficiency and open-vocabulary generalization.

Business Value

Enables creation of highly customized and diverse visual content from textual descriptions, powering applications in marketing, design, entertainment, and personalized content generation.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference, typical for large generative models.

Limitations Addressed

Challenges in effectively fusing text with visual tokens in token-centric T2I architectures; limitations of non-fusion and late-fusion baselines.

Performance Gains

Strong data efficiency and improved performance over baselines.

Technical Tags

text-to-image (T2I)joint-embedding predictive architectureTransformertext fusioncross-attentionconditional denoisingflow matchingdiscrete tokensopen-vocabulary generalizationdata efficiency

Research Topics

Multimodal Text-to-Image GenerationJoint Embedding ArchitecturesEfficient Fusion of Text and VisionTransformer-based Generative ModelsImproving T2I Data Efficiency

Methods & Architectures

JEPA-TJoint-embedding predictive TransformerCross-attentionText fusionFlow matching lossDiscrete visual and textual tokensClass-conditional generationFree-text generation TransformerDiffusion Models (implied by denoising)

Applications & Tasks

Image Generation Natural Language Processing Computer Vision Creative Arts Image GenerationText-to-Image SynthesisMultimodal Fusion Generating images from text descriptionsClass-conditional image generation

Datasets & Benchmarks

Datasets

ImageNet-1K

Benchmarks

Strong data efficiency • Open-vocabulary generalization • Consistently outperforms non-fusion and late-fusion baselines on ImageNet-1K.

Related Fields

Generative AIComputer VisionNatural Language ProcessingDeep LearningTransformers

Keywords

text-to-imageT2Igenerative modelsdiffusion modelstransformersmultimodaljoint embeddingtext fusionimage generationopen-vocabularydata efficiencyJEPA-Tcross-attention

Academic Context

#Multimodal Text-to-Image Generation#Joint Embedding Architectures#Efficient Fusion of Text and Vision#Transformer-based Generative Models#Improving T2I Data Efficiency

Commercial Potential

Potential Products

AI-powered image generation toolsContent creation platformsPersonalized advertising tools

Target Industries

Media and EntertainmentAdvertisingDesignGamingE-commerce

Use Case Examples

Generating unique illustrations for articles or marketing campaigns based on text prompts.Creating concept art for games or movies.Personalizing visual content for individual users.

Competitive Edge

Offers improved data efficiency and open-vocabulary capabilities compared to existing T2I models, particularly those using non-fusion or late-fusion approaches.

Market Opportunity

Rapidly growing market for generative AI and creative tools.

Revenue Models

SaaS subscriptionsAPI accesslicensing.

Resource Requirements

Compute Needs

Training and inference require substantial GPU resources.

Data Requirements

Requires large datasets of image-caption pairs.

Deployment Constraints

Inference speed and computational cost are significant considerations.

Scalability

Scalability depends on the efficiency of the Transformer architecture and the diffusion process.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for commercial applications.

Licensing

Likely academic/research use, specific license TBD.

Patent Potential

Moderate, for the specific architecture and fusion techniques.

View Full Paper Back to Papers