arxiv_cv 90% Match Research Paper AI researchers in generative models,3D artists,Game developers,AR/VR content creators 1 week ago

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

generative-ai › diffusion

📄 Abstract

Abstract: We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has mainly focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise--requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

Authors (5)

Yunhong Min

Daehyeon Choi

Kyeongmin Yeo

Jihyun Lee

Minhyuk Sung

Submitted

March 28, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation. It uses reward-guided sampling with Langevin dynamics and adaptive time rescaling to control object orientation without explicit training for each object/category.

Business Value

Enables more precise and controllable generation of 3D assets from text descriptions, significantly benefiting industries like game development, VR/AR content creation, and product design by automating and refining the asset creation process.

Paper Metadata

Innovation Type

novel method/framework

Deployment Feasibility

Moderate. Requires integration with existing text-to-image models and potentially significant computational resources for sampling.

Limitations Addressed

Lack of 3D orientation control in existing text-to-image generation methods; challenges in maintaining image realism with gradient-ascent optimization; inefficiency of training-based methods for diverse objects.

Performance Gains

outperforms training-based and test-time guidance methods

Technical Tags

zero-shot learning3D orientation groundingtext-to-image generationreward-guided samplingLangevin dynamicsgenerative flow modeladaptive time rescalingobject orientationspatial groundingdiffusion models

Research Topics

Generative AIText-to-Image Synthesis3D Computer VisionZero-Shot LearningDiffusion Models

Methods & Architectures

reward-guided samplingLangevin dynamicsone-step text-to-image generative flow modeladaptive time rescalingdiscriminative model for 3D orientation estimation ORIGENgenerative flow modeldiscriminative model

Applications & Tasks

computer graphics virtual reality augmented reality content creation 3D modeling lack of control over 3D orientation in text-to-image generationdifficulty maintaining image realism with gradient-ascent optimizationinefficiency of training-based methods for diverse objects/categories 3D orientation groundingtext-to-image generation with orientation controlzero-shot spatial control

Datasets & Benchmarks

Benchmarks

outperforms both training-based and test-time guidance methods

Related Fields

Generative AIComputer Graphics3D VisionNatural Language ProcessingDiffusion Models

Keywords

text-to-image3D orientationzero-shotgenerative modelsdiffusion modelsLangevin dynamicsspatial groundingcontent creationcomputer graphicsARVRobject control

Academic Context

#Generative AI#Text-to-Image Synthesis#3D Computer Vision#Zero-Shot Learning#Diffusion Models

Commercial Potential

Potential Products

3D asset generation tools for games and VRAI-powered 3D modeling softwarePlugins for existing 3D design software

Target Industries

GamingVirtual RealityAugmented RealityFilm and AnimationProduct Design

Use Case Examples

Generating 3D models of furniture with specific orientations for interior designCreating game assets with precise 3D placement and rotationPopulating virtual worlds with objects oriented as described in text

Competitive Edge

First zero-shot method for 3D orientation grounding, offering control over object pose in text-to-image generation, which is a significant advancement over methods focused solely on 2D positioning.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers