DeepSeekInformation TheoryVisual CompressionAI Architecture

Why One Visual Token Beats Ten Text Tokens: Information Theory Lessons from DeepSeek-OCR

October 20, 2025•金色传说大聪明•10 min read

Is Text Really the Best Way to Compress Information?

This deceptively simple question cuts deep. DeepSeek-OCR finally provides data-driven answers: visual tokens can be more efficient than text tokens.

Understanding Visual Compression Through Information Theory

A highly-upvoted Hacker News comment captured the key insight:

Text tokens = discrete lookup table:

Small integer (token ID) → lookup → vector
Limited token space: typically ~100K possible tokens
Each token maps to a few UTF-8 bytes
Most tokenizers don't create cross-word-boundary tokens

Visual tokens = continuous value vectors:

No lookup table—direct image-to-vector encoding
Massive token space: high-dimensional float vectors with many possible values per dimension
Can convey more bits per token

This explains DeepSeek-OCR's 10x compression capability.

DeepEncoder: Elegant Three-Stage Architecture

DeepSeek-OCR's core is the DeepEncoder architecture—just 380M parameters but meticulously designed:

Stage 1: Low-Activation Local Processing

80M SAM-base + windowed attention
1024×1024 images → 4096 patch tokens
Controlled activation memory

Stage 2: 16× Compression

2-layer convolutional module, 16× downsampling
4096 tokens → 256 tokens
Drastically reduces computation before global attention

Stage 3: Global Semantic Understanding

300M CLIP-large + global attention
Deep understanding of 256 compressed tokens
Acceptable compute thanks to reduced input size

The brilliance lies in efficiency:

Most VLMs: 72B-76B activated parameters
DeepSeek-OCR decoder: 3B parameters, only 570M activated
MoE architecture activates sparse experts per inference

Multi-Resolution Support: Tiny to Gundam

Six modes support various document types:

Mode	Resolution	Tokens	Use Case
Tiny	512×512	64	Simple slides, docs
Small	640×640	100	General documents
Base	1024×1024	256	Complex documents
Large	1280×1280	400	High-quality docs
Gundam	Dynamic	800+	Newspapers, ultra-high-res

One model adapts "compression strength" to document complexity.

Compression-Accuracy Trade-offs

Fox benchmark data reveals compression boundaries:

10× compression: ~97% accuracy—the sweet spot for most documents

20× compression: ~60% accuracy—degradation from complex layouts and low-resolution blur

The blur effect naturally mimics a "forgetting mechanism"—foreshadowing long-context applications.

Beyond OCR: Deep Document Parsing

DeepSeek-OCR goes far beyond text recognition:

Chart conversion: Financial reports → structured data (bar/line/pie charts)
Chemical formulas: Molecular diagrams → SMILES format (critical for research)
Geometric figures: Educational applications
100+ languages: Exceptional performance on both common and rare languages

The Most Imaginative Part: Memory Forgetting

The paper's coolest proposal: simulate human memory decay through resolution reduction.

Human memory fades with time. DeepSeek-OCR can mimic this:

1 hour ago: Crystal clear → Gundam mode (800+ tokens)
1 week ago: Getting fuzzy → Base mode (256 tokens)
1 year ago: Nearly forgotten → Tiny mode (64 tokens)

This enables "theoretically infinite context windows" by letting distant memories naturally fade—just like human cognition.

Open Source and Community Impact

Fully open-sourced under MIT license:

GitHub: https://github.com/deepseek-ai/DeepSeek-OCR
HuggingFace: https://huggingface.co/deepseek-ai/DeepSeek-OCR

Rapidly gained 3.3K+ GitHub stars and became a Hacker News sensation. Community discussions focus not just on implementation, but on the paradigm shift: "vision as information compression medium."

Conclusion: Revisiting the Question

Is text the best way to compress information?

DeepSeek-OCR's answer: Not necessarily.

From information theory:

Visual tokens convey more bits per token
Images are 2D; text is 1D
Visual tokens operate in semantic space; text tokens are just subword slices

From evolutionary biology:

Vision is humanity's primary information processing channel
Our ancestors survived by seeing for hundreds of thousands of years before writing existed
Egyptian hieroglyphs and Dunhuang murals were themselves forms of compression

DeepSeek-OCR is doing what humans did millennia ago—except this time, it's AI learning from human wisdom.

About 金色传说大聪明

赛博禅心公众号作者

https://mp.weixin.qq.com/s/Vw1DJq0kB_GgyebwYEg0dA