Advertisement
DeepSeekInformation TheoryVisual CompressionAI Architecture

Why One Visual Token Beats Ten Text Tokens: Information Theory Lessons from DeepSeek-OCR

October 20, 2025金色传说大聪明10 min read

Is Text Really the Best Way to Compress Information?

This deceptively simple question cuts deep. DeepSeek-OCR finally provides data-driven answers: visual tokens can be more efficient than text tokens.

Understanding Visual Compression Through Information Theory

A highly-upvoted Hacker News comment captured the key insight:

Text tokens = discrete lookup table:

  • Small integer (token ID) → lookup → vector
  • Limited token space: typically ~100K possible tokens
  • Each token maps to a few UTF-8 bytes
  • Most tokenizers don't create cross-word-boundary tokens

Visual tokens = continuous value vectors:

  • No lookup table—direct image-to-vector encoding
  • Massive token space: high-dimensional float vectors with many possible values per dimension
  • Can convey more bits per token

This explains DeepSeek-OCR's 10x compression capability.

DeepEncoder: Elegant Three-Stage Architecture

DeepSeek-OCR's core is the DeepEncoder architecture—just 380M parameters but meticulously designed:

Stage 1: Low-Activation Local Processing

  • 80M SAM-base + windowed attention
  • 1024×1024 images → 4096 patch tokens
  • Controlled activation memory

Stage 2: 16× Compression

  • 2-layer convolutional module, 16× downsampling
  • 4096 tokens → 256 tokens
  • Drastically reduces computation before global attention

Stage 3: Global Semantic Understanding

  • 300M CLIP-large + global attention
  • Deep understanding of 256 compressed tokens
  • Acceptable compute thanks to reduced input size

The brilliance lies in efficiency:

  • Most VLMs: 72B-76B activated parameters
  • DeepSeek-OCR decoder: 3B parameters, only 570M activated
  • MoE architecture activates sparse experts per inference

Multi-Resolution Support: Tiny to Gundam

Six modes support various document types:

ModeResolutionTokensUse Case
Tiny512×51264Simple slides, docs
Small640×640100General documents
Base1024×1024256Complex documents
Large1280×1280400High-quality docs
GundamDynamic800+Newspapers, ultra-high-res

One model adapts "compression strength" to document complexity.

Compression-Accuracy Trade-offs

Fox benchmark data reveals compression boundaries:

10× compression: ~97% accuracy—the sweet spot for most documents

20× compression: ~60% accuracy—degradation from complex layouts and low-resolution blur

The blur effect naturally mimics a "forgetting mechanism"—foreshadowing long-context applications.

Beyond OCR: Deep Document Parsing

DeepSeek-OCR goes far beyond text recognition:

  • Chart conversion: Financial reports → structured data (bar/line/pie charts)
  • Chemical formulas: Molecular diagrams → SMILES format (critical for research)
  • Geometric figures: Educational applications
  • 100+ languages: Exceptional performance on both common and rare languages

The Most Imaginative Part: Memory Forgetting

The paper's coolest proposal: simulate human memory decay through resolution reduction.

Human memory fades with time. DeepSeek-OCR can mimic this:

  • 1 hour ago: Crystal clear → Gundam mode (800+ tokens)
  • 1 week ago: Getting fuzzy → Base mode (256 tokens)
  • 1 year ago: Nearly forgotten → Tiny mode (64 tokens)

This enables "theoretically infinite context windows" by letting distant memories naturally fade—just like human cognition.

Open Source and Community Impact

Fully open-sourced under MIT license:

Rapidly gained 3.3K+ GitHub stars and became a Hacker News sensation. Community discussions focus not just on implementation, but on the paradigm shift: "vision as information compression medium."

Conclusion: Revisiting the Question

Is text the best way to compress information?

DeepSeek-OCR's answer: Not necessarily.

From information theory:

  • Visual tokens convey more bits per token
  • Images are 2D; text is 1D
  • Visual tokens operate in semantic space; text tokens are just subword slices

From evolutionary biology:

  • Vision is humanity's primary information processing channel
  • Our ancestors survived by seeing for hundreds of thousands of years before writing existed
  • Egyptian hieroglyphs and Dunhuang murals were themselves forms of compression

DeepSeek-OCR is doing what humans did millennia ago—except this time, it's AI learning from human wisdom.

About 金色传说大聪明

赛博禅心公众号作者

https://mp.weixin.qq.com/s/Vw1DJq0kB_GgyebwYEg0dA
Advertisement