Advertisement
DeepSeekSilicon ValleyAI InnovationJPEG Moment

AI's JPEG Moment: Why Silicon Valley Can't Stop Raving About DeepSeek-OCR

October 21, 2025一水15 min read

Why Silicon Valley Can't Stop Raving

Silicon Valley is absolutely raving about DeepSeek's latest open-source model!

It's quintessentially DeepSeek—3B parameters, exponential efficiency gains, elegant simplicity—some even claim it open-sourced Google Gemini's closely-guarded trade secrets.

The only problem? Being held back by the "OCR" name.

Core Innovation: Vision as Text Compression Medium

DeepSeek-OCR tackles the computational explosion in long-context processing. Though parameter-small, it leverages monumental force through "compress everything visually"—elegant simplicity.

Core insight:

  • One image contains massive text (using fewer tokens)
  • Vision as text compression medium
  • Like speed readers scanning pages instantly, not word-by-word

Stunning Compression Results

DeepSeek's findings:

Under 10× compression:

  • Text tokens = 10× visual tokens
  • OCR decoding accuracy: 97%

At 20× compression:

  • Accuracy maintains ~60%
  • Remarkably capable

Production efficiency:

  • Single A100-40G GPU
  • Generates 200K+ pages of quality LLM/VLM training data daily

GitHub and HuggingFace Explode

Post-launch:

  • GitHub: 3.3K+ stars
  • HuggingFace: Trending #2
  • Hacker News: Hot topic

Karpathy and Community Response

Andrej Karpathy (Former Tesla AI Director):

"I really like this... especially images being better LLM inputs than words, brilliant."

Community comments:

  • "This is AI's JPEG moment"
  • "Opens new paths for AI memory architecture"
  • "Google Gemini's core trade secret, now open-sourced"

Two Core Components

Encoder: DeepEncoder Converts images to highly compressed visual tokens.

Design highlights:

  • Local processing: 80M SAM-base with windowed attention
  • 16× compression: 2-layer conv module, 4096 → 256 tokens
  • Global understanding: 300M CLIP-large with global attention

Key advantages:

  • Most VLMs: 72B-76B activated parameters
  • DeepSeek-OCR decoder: 3B parameters, only 570M activated
  • MoE architecture activates sparse experts per inference

Decoder: DeepSeek-3B-MoE Reconstructs text from compressed visual tokens.

  • Activated parameters: 570M
  • Expressive capacity: Equivalent to 3B model
  • Inference efficiency: Similar to 500M small model

OmniDocBench: New SOTA

On mainstream document parsing benchmark:

ComparisonDeepSeek-OCRGOT-OCR2.0MinerU2.0
Tokens1002567000+
PerformanceSurpassesBaselineOutperformed

Detailed comparison:

  • 100 tokens → Surpasses GOT-OCR2.0's 256 tokens
  • 400 tokens (285 effective) → Matches previous SOTA
  • 800 tokens → Far exceeds MinerU2.0's 7000+ tokens

Multi-Resolution Support: Tiny to Gundam

ModeResolutionTokensCompressionUse Case
Tiny512×51264~20×Simple docs, slides
Small640×640100~15×General books, reports
Base1024×1024256~10×Standard documents
Large1280×1280400~7×High-quality docs
GundamDynamic800+~5×Newspapers, ultra-high-res

Practical performance:

  • Books and reports: 100 visual tokens for good performance
  • Most documents: Under 1000 text tokens
  • Best results: Visual token compression ≤10×

Beyond OCR: Deep Parsing Capabilities

DeepSeek-OCR transcends text recognition:

1. Chart conversion: Financial reports → structured data (bar/line/pie charts)

2. Chemical formulas: Molecular diagrams → SMILES format (critical for research)

3. Mathematical geometry: Geometric figure recognition (educational applications)

4. Multilingual support: Nearly 100 languages (common and rare)

5. General image understanding: Description, object detection, grounding

Memory Forgetting: Simulating Human Intelligence

DeepSeek proposes a mind-bending idea: simulate human forgetting through optical compression.

Core analogy:

  • Human memory: Decays over time
  • Visual perception: Degrades with spatial distance
  • Both exhibit progressive information loss patterns

Implementation:

Time DimensionMemory ClarityModeTokensCompression
Very recentCrystal clearGundam800+Low
RecentBasically clearLarge400Medium-low
Medium-termGetting fuzzyBase256Medium
DistantVery fuzzySmall100Medium-high
AncientNearly forgottenTiny64High

Theoretical significance:

  • Recent information maintains high fidelity
  • Distant memories progressively compress, naturally fading
  • Could enable "theoretically infinite context windows"
  • Not infinite expansion, but natural information decay

Three Modest Authors

Haoran Wei: Former employee at StepFun, led GOT-OCR2.0 development, continued technical path with DeepSeek-OCR

Yaofeng Sun: Contributed to DeepSeek R1, V3, and multiple models, continuous core team member

Yukun Li: Nearly 10K Google Scholar citations, participated in DeepSeek V2/V3 development

Why "AI's JPEG Moment"?

JPEG's historical lesson:

  • 1992 JPEG standard released
  • Lossy compression drastically reduced image file sizes
  • 10:1+ compression imperceptible to human eyes
  • Revolutionized image storage and transmission

DeepSeek-OCR parallel:

  • Visual tokens compress text tokens
  • 97% accuracy at 10:1 compression
  • Revolutionizes multimodal model efficiency
  • Novel approach to long-context problems

Practical Value and Deployment

Data generation efficiency:

  • 20 compute nodes (8× A100-40G each)
  • Daily generation: 33 million pages of training data
  • Single GPU: 200K+ pages per day

Application scenarios:

  1. LLM/VLM pre-training data generation
  2. Document deep parsing: Financial reports, research papers
  3. Scientific document processing: Chemical formulas, mathematical equations
  4. Multilingual document parsing: 100+ language support
  5. Long conversation systems: Leveraging memory forgetting mechanism

Open Source and Future

Open source info:

Future directions:

  1. Higher compression: Explore 20×+ compression possibilities
  2. Refine forgetting mechanism: Validate in long-context scenarios
  3. Expand applications: From OCR to general visual-text compression
  4. Optimize inference: Further reduce computational costs

Conclusion: Paradigm Shift

DeepSeek-OCR isn't just a technical breakthrough—it's a paradigm shift:

From 1D to 2D: Text is linear; images enable parallel understanding

From discrete to continuous: Text tokens = lookup tables; visual tokens = continuous vector spaces

From memory to forgetting: Traditional AI pursues infinite memory; DeepSeek-OCR learns human forgetting

From perfection to efficiency: Not pursuing 100% accuracy—achieving 10× efficiency at 97% accuracy

As the community says, this may be "AI's JPEG moment"—not perfect losslessness, but revolutionary efficiency gains at acceptable quality loss.

DeepSeek proves once again: Elegant simplicity, efficiency rules.

Advertisement