Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent agentic language models increasingly need to interact with real-world
environments that contain tightly intertwined visual and textual information,
often through raw camera pixels rather than separately processed images and
tokenized text. This shift highlights the need for a unified perception
paradigm. To investigate this idea, we explore Perceive Everything as Pixels
(PEAP) and introduce PixelWorld, a benchmark that renders natural-language,
tabular, mathematical, and diagrammatic inputs into a shared pixel space.
Experiments across multiple benchmarks show that PEAP achieves comparable
performance to token-based approaches on semantic understanding tasks,
suggesting that vision transformers can partially capture global textual
semantics without explicit tokenization. In contrast, reasoning-intensive tasks
such as mathematics and code show notable performance degradation, although
Chain-of-Thought prompting helps mitigate this gap by compensating for missing
symbolic structure. We further find that when visual and textual information
are closely integrated, representing everything as pixels simplifies
preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a
systematic and practical framework for evaluating unified vision--language
models and facilitates further exploration of pixel-based multimodal learning.
Authors (3)
Zhiheng Lyu
Xueguang Ma
Wenhu Chen
Submitted
January 31, 2025
Key Contributions
Introduces PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space for evaluation using the Perceive Everything as Pixels (PEAP) paradigm. It shows Vision Transformers can partially capture textual semantics from pixels, but reasoning tasks degrade without explicit tokenization.
Business Value
Paves the way for more versatile AI agents that can understand and interact with a wider range of real-world information sources, crucial for robotics, augmented reality, and complex data analysis.