arxiv_cv 92% Match Research Paper NLP Researchers,Computer Vision Engineers,AI Developers,Archivists,Information Scientists 2 weeks ago

DeepSeek-OCR: Contexts Optical Compression

large-language-models › multimodal-llms

📄 Abstract

Abstract: We present DeepSeek-OCR as an initial investigation into the feasibility of compressing long contexts via optical 2D mapping. DeepSeek-OCR consists of two components: DeepEncoder and DeepSeek3B-MoE-A570M as the decoder. Specifically, DeepEncoder serves as the core engine, designed to maintain low activations under high-resolution input while achieving high compression ratios to ensure an optimal and manageable number of vision tokens. Experiments show that when the number of text tokens is within 10 times that of vision tokens (i.e., a compression ratio < 10x), the model can achieve decoding (OCR) precision of 97%. Even at a compression ratio of 20x, the OCR accuracy still remains at about 60%. This shows considerable promise for research areas such as historical long-context compression and memory forgetting mechanisms in LLMs. Beyond this, DeepSeek-OCR also demonstrates high practical value. On OmniDocBench, it surpasses GOT-OCR2.0 (256 tokens/page) using only 100 vision tokens, and outperforms MinerU2.0 (6000+ tokens per page on average) while utilizing fewer than 800 vision tokens. In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G). Codes and model weights are publicly accessible at http://github.com/deepseek-ai/DeepSeek-OCR.

Authors (3)

Haoran Wei

Yaofeng Sun

Yukun Li

Submitted

October 21, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces DeepSeek-OCR, a novel approach for compressing long contexts using optical 2D mapping, enabling efficient OCR. The DeepEncoder maintains low activations under high resolution for high compression ratios, achieving 97% OCR precision at <10x compression, showing promise for historical documents and LLM memory.

Business Value

Enables efficient processing and analysis of vast amounts of textual data, particularly historical documents or lengthy reports, unlocking insights and improving accessibility for research and business intelligence.

Paper Metadata

Innovation Type

Algorithmic Framework and Model Architecture

Deployment Feasibility

Moderate. Requires specialized hardware for high-resolution processing and efficient implementation of the DeepEncoder and MoE decoder. Practicality depends on achieving desired accuracy-compression trade-offs.

Limitations Addressed

Challenges in processing and compressing very long contexts in LLMs; limitations of traditional OCR methods on diverse document types; computational cost of handling high-resolution inputs.

Performance Gains

Surpasses GOT-OCR2.0 on OmniDocBench using fewer vision tokens

Technical Tags

optical character recognition (OCR)long context compressionvision tokensdeep encoderMoEcompression ratiodecoding precisionhistorical documentsmemory mechanismsLLMs

Research Topics

Efficient LLM ArchitecturesMultimodal LearningContext CompressionOptical Character RecognitionLong Document Processing

Methods & Architectures

2D mappingoptical compressionencoder-decoder architectureMixture-of-Experts (MoE) DeepEncoderDeepSeek3B-MoE-A570M

Applications & Tasks

Natural Language Processing Document Analysis Information Retrieval Artificial Intelligence Long Context HandlingInformation OverloadComputational EfficiencyOCR Accuracy Compressing long contextsOptical Character RecognitionProcessing historical documents

Datasets & Benchmarks

Datasets

OmniDocBench

Benchmarks

OCR precision: 97% at <10x compression • OCR accuracy: ~60% at 20x compression

OCR precisionOCR accuracycompression ratio

Related Fields

Natural Language ProcessingComputer VisionMachine LearningInformation Retrieval

Keywords

OCRlong contextcompressionLLMmultimodalDeepSeek-OCRDeepEncoderMoEdocument analysishistorical documentsvision tokensAI

Academic Context

#Efficient LLM Architectures#Multimodal Learning#Context Compression#Optical Character Recognition#Long Document Processing

Companies & Organizations

Companies Mentioned

DeepSeek

Startup Context

Mention of 'DeepSeek' as a potential entity/project name

Commercial Potential

Potential Products

OCR software for historical archivesTools for summarizing and analyzing long documentsEfficient context handling modules for LLMs

Target Industries

PublishingArchivesLegalFinanceResearch

Use Case Examples

Digitizing and analyzing vast collections of historical texts.Extracting information from lengthy legal or financial reports.Improving the ability of LLMs to recall information from long conversations or documents.

Competitive Edge

Offers a novel approach to long context compression specifically tailored for OCR tasks, achieving high precision at significant compression ratios, potentially outperforming existing methods in specific scenarios.

Resource Requirements

Compute Needs

High, especially for training the DeepEncoder and MoE decoder on high-resolution inputs.

Data Requirements

Requires large datasets of documents with corresponding ground truth OCR, potentially including historical texts.

Deployment Constraints

Computational resources for high-resolution image processing; accuracy trade-offs at higher compression ratios.

Scalability

The MoE architecture suggests potential for scalability, but high-resolution input processing remains a bottleneck.

Regulatory Considerations

None directlybut applications involving sensitive documents might have data privacy implications.

Production Readiness

Maturity Level

Research/Development

Licensing

Depends on the specific implementation and underlying models (e.g., DeepSeek3B-MoE-A570M).

Patent Potential

Moderate, for the DeepEncoder architecture and the optical compression technique.

View Full Paper Back to Papers