arxiv_cl 88% Match Benchmark Paper AI Researchers,Machine Learning Engineers,Developers of AI Agents,Computer Vision Experts 17 hours ago

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

computer-vision › scene-understanding

📄 Abstract

Abstract: Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored. Inspired by how humans reason over sketches, we advocate SVG code as a compact, interpretable, and executable visual representation. We introduce VCode, a benchmark that reframes multimodal understanding as code generation: given an image, a model must produce SVG that preserves symbolic meaning for downstream reasoning. VCode covers three domains - general commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel evaluation protocol in which a policy model answers questions over rendered SVGs; correct answers indicate faithful symbolic preservation. Empirically, frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap between language-centric and visual-centric coding. To close this gap, we introduce VCoder, an agentic framework that augments VLMs along two axes: (i) Thinking with Revision, which iteratively analyzes discrepancies and refines SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply structured cues such as objects, shapes, and text beyond the model's intrinsic capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities score well overall yet remain limited in professional knowledge and 3D reasoning. VCoder delivers a 12.3-point overall gain over the top-performing Claude-4-Opus. Human studies show that both humans and VLMs perform worse on rendered SVGs, their consistency reveals the promise of symbolic visual representation. The benchmark and code are available at https://github.com/CSU-JPG/VCode.

Key Contributions

Introduces VCode, a novel benchmark reframing multimodal understanding as SVG code generation, using SVG as a symbolic visual representation. It proposes the CodeVQA evaluation protocol to assess symbolic fidelity and highlights the struggle of current VLMs in generating faithful SVGs, revealing a gap in visual-centric coding capabilities.

Business Value

Facilitates the development of more capable AI agents that can understand and generate visual code, leading to advancements in areas like automated UI design, visual content creation, and more intuitive human-AI interaction.

Paper Metadata

Innovation Type

New benchmark and evaluation protocol for visual-centric coding

Deployment Feasibility

Medium; requires development of new models or significant adaptation of existing VLMs to handle SVG generation and symbolic reasoning effectively.

Limitations Addressed

Underexplored visual-centric coding tasks; lack of benchmarks for evaluating symbolic meaning preservation in multimodal models; difficulty of VLMs in generating faithful SVG code.

Performance Gains

Highlights performance gaps of frontier VLMs, indicating areas for improvement rather than direct gains.

Technical Tags

multimodal codingSVGsymbolic visual representationimage to codevisual reasoningbenchmarkcode generationagent eravisual-centric codingVCode

Research Topics

Multimodal AIVision-Language ModelsCode GenerationBenchmarkingSymbolic Reasoning

Methods & Architectures

SVG as symbolic representationImage to SVG code generationMultimodal understanding benchmarkCodeVQA evaluation protocol Vision-Language Models (VLMs)Large Language Models (LLMs)

Applications & Tasks

Computer Graphics Web Development AI Agents Human-Computer Interaction Generating executable visual representations from imagesEvaluating symbolic fidelity in multimodal modelsBridging visual perception and symbolic reasoning Generate SVG code from imagesPerform visual question answering over rendered SVGsEvaluate multimodal models on visual-centric coding tasks

Datasets & Benchmarks

Datasets

MM-Vet, MMMU, CV-Bench

Symbolic fidelityCode generation accuracyCodeVQA scores

Related Fields

Computer VisionNatural Language ProcessingGenerative AIHuman-Computer InteractionAI Agents

Keywords

VCodeSVGMultimodal AIVision-Language ModelsCode GenerationSymbolic ReasoningBenchmarkImage to CodeVisual UnderstandingAI AgentsComputer GraphicsWeb DevelopmentMM-VetMMMUCV-Bench

Academic Context

#Multimodal AI#Vision-Language Models#Code Generation#Benchmarking#Symbolic Reasoning

Commercial Potential

Potential Products

Automated UI/UX design toolsVisual programming assistantsAI-powered graphic design software

Target Industries

Web DevelopmentSoftware EngineeringGraphic DesignAI Development

Use Case Examples

Generating website layouts from sketchesCreating interactive visualizations from dataBuilding AI agents that can manipulate visual interfaces

Competitive Edge

Establishes a new benchmark for visual-centric coding, pushing the state-of-the-art beyond language-centric tasks.

Market Opportunity

Growing market for AI-powered creative tools and agent development platforms.

Revenue Models

Licensing of modelsAPI accessspecialized software tools.

Resource Requirements

Compute Needs

Significant compute for training and evaluating large multimodal models.

Data Requirements

Requires diverse image datasets and corresponding SVG code representations.

Deployment Constraints

Model complexity and computational cost for real-time SVG generation.

Scalability

Scalability depends on the underlying VLM architecture and the complexity of SVG generation.

Production Readiness

Maturity Level

Research/Benchmark

Time to Market

Medium to long, as it requires significant model development.

View Full Paper Back to Papers