📄 Abstract
Abstract: Code has emerged as a precise and executable medium for reasoning and action
in the agent era. Yet, progress has largely focused on language-centric tasks
such as program synthesis and debugging, leaving visual-centric coding
underexplored. Inspired by how humans reason over sketches, we advocate SVG
code as a compact, interpretable, and executable visual representation. We
introduce VCode, a benchmark that reframes multimodal understanding as code
generation: given an image, a model must produce SVG that preserves symbolic
meaning for downstream reasoning. VCode covers three domains - general
commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric
perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel
evaluation protocol in which a policy model answers questions over rendered
SVGs; correct answers indicate faithful symbolic preservation. Empirically,
frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap
between language-centric and visual-centric coding. To close this gap, we
introduce VCoder, an agentic framework that augments VLMs along two axes: (i)
Thinking with Revision, which iteratively analyzes discrepancies and refines
SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply
structured cues such as objects, shapes, and text beyond the model's intrinsic
capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities
score well overall yet remain limited in professional knowledge and 3D
reasoning. VCoder delivers a 12.3-point overall gain over the top-performing
Claude-4-Opus. Human studies show that both humans and VLMs perform worse on
rendered SVGs, their consistency reveals the promise of symbolic visual
representation. The benchmark and code are available at
https://github.com/CSU-JPG/VCode.
Key Contributions
Introduces VCode, a novel benchmark reframing multimodal understanding as SVG code generation, using SVG as a symbolic visual representation. It proposes the CodeVQA evaluation protocol to assess symbolic fidelity and highlights the struggle of current VLMs in generating faithful SVGs, revealing a gap in visual-centric coding capabilities.
Business Value
Facilitates the development of more capable AI agents that can understand and generate visual code, leading to advancements in areas like automated UI design, visual content creation, and more intuitive human-AI interaction.