Andrej KarpathyDeepSeek-OCRPixels vs TextLLM ArchitectureAI Commentary

Karpathy Speaks: Did We Feed AI the Wrong "Diet" from the Start?

October 24, 2025•Ben / 浮浮酱•12 min read

Karpathy's Disruptive Take: Are Pixels the Ideal LLM Input?

An OCR Review That's "Not About OCR"

On October 20, 2025, DeepSeek released the DeepSeek-OCR paper. Conventionally, industry experts would focus on metrics like "recognition rate improvements" or "which models were outperformed."

But Andrej Karpathy—former Tesla AI Director and OpenAI co-founder—took a completely unexpected angle. He tweeted:

"I quite like the new DeepSeek-OCR paper. It's a good OCR model (maybe a bit worse than DOTS), and yes data collection etc., but anyway it doesn't matter. The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is: Are pixels better inputs to LLMs than text? Are text tokens wasteful and just terrible, at the input?"

This is packed with implications. Karpathy essentially waved his hand and said: "OCR performance? Doesn't matter." What truly excited him was a fundamental hypothesis inadvertently validated by DeepSeek-OCR—we've been feeding AI the wrong "diet" from the start.

Core Argument: Why Pixels Over Text?

Karpathy proposed a bold vision: Maybe all LLM inputs should only ever be images (pixels). Even if you have pure text input, maybe you'd prefer to render it and then feed that in.

Sounds counterintuitive, right? Why convert perfectly good text into images?

Karpathy offers four reasons:

1. Superior Information Compression

DeepSeek-OCR revealed a stunning metric: 100 visual tokens can accurately "decompress" content equivalent to 1000 text tokens.

It's like feeding AI:

Text input: A verbose "instruction manual" (1000 tokens)
Pixel input: A compact "information energy bar" (100 tokens)

Shorter context windows mean higher efficiency. As Karpathy noted: "More information compression => shorter context windows, more efficiency."

2. More General, More Faithful Information Stream

Imagine asking AI to read a webpage.

Current text input: Like reading webpage content to AI over the phone. All visual information—bold, colors, fonts, layouts—is lost.

Pixel input: Like sending AI a screenshot directly.

Karpathy argues pixels provide a "significantly more general information stream"—handling not just text, but:

Bold text
Colored text
Arbitrary images

This is the advantage of "information fidelity": what you see, AI can "see" too.

3. Unlocking Bidirectional Attention

This is more technical.

Current text tokens typically use autoregressive attention, meaning models process "left-to-right" sequentially, unable to "look back."

Pixel input easily enables bidirectional attention—like human reading where you survey the whole page first, grasp global structure, then focus on details.

Karpathy believes this approach is "a lot more powerful."

4. Delete the Tokenizer

This is Karpathy's most passionate point. He declared bluntly:

"Delete the tokenizer (at the input)!! I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end-to-end stage. It 'imports' all the ugliness of Unicode, byte encodings, inherits historical baggage, security/jailbreak risk (e.g. continuation bytes)... The tokenizer must go."

Why does Karpathy hate tokenizers so intensely?

The Five Crimes of Tokenizers:

Crime 1: Distorting Information Perception

A smiling emoji "😀":

Via tokenizer: AI sees a cryptic internal code like [tok482]. AI can't leverage its learned knowledge about "faces" and "smiles" from vision (transfer learning).
Via pixel input: AI's "vision" immediately recognizes: oh, that's a smiling face.

Crime 2: Identical-Looking Characters, Internally Different

Latin "A" vs Greek "Α" (Alpha) look nearly identical to human eyes, but tokenizers map them to completely different tokens.

Crime 3: Historical Baggage

Unicode, byte encodings, various character sets... tokenizers inherit all these "legacy problems," forcing models to handle task-irrelevant complexity.

Crime 4: Security Risks

Karpathy mentions "continuation bytes": attackers can exploit tokenizer encoding quirks to construct malicious inputs that bypass safety checks (jailbreak).

Crime 5: Not End-to-End

Tokenizers are "middlemen" forcibly inserted between "raw text" and "AI brain," violating deep learning's "end-to-end learning" philosophy.

Karpathy's verdict: The tokenizer must go.

New AI Architecture Vision: "Input with Eyes, Output with Mouth"

Based on this analysis, Karpathy envisions a new AI architecture:

Input (User Message): Only receives images (pixels) Output (Assistant Response): Remains text

Why this design?

Input: Why Pixels?

OCR is just one of many "vision→text" tasks. Others include chart understanding, handwriting recognition, scene text extraction...
"Text→text" tasks can become "vision→text" tasks, not vice versa.

In other words: visual input is a more universal "superset."

Output: Why Still Text?

Karpathy admits: "It's a lot less obvious how to output pixels realistically... or if you'd want to."

Simple reasons:

Input task: "Understanding an image" is relatively easy, with mature vision encoders.
Output task: "Generating a realistic image" is extremely hard, requiring generative models, high cost, unstable results.

Moreover, for most applications (chatbots, document analysis, code generation), users need text answers, not image outputs.

Thus, "Input with eyes (pixels), output with mouth (text)" leverages visual input advantages while maintaining text output practicality.

How Does This Relate to DeepSeek-OCR?

Karpathy sees DeepSeek-OCR as a "Proof-of-Concept":

It experimentally proved: using "vision" to "read" is feasible and potentially more efficient.

DeepSeek-OCR's key metrics:

10x compression with 97% accuracy
100 tokens outperform GOT-OCR2.0's 256 tokens

This isn't just "text-to-text" becoming "vision-to-text." It suggests a fundamental shift—AI's primary information gateway is moving from "language" to "vision".

Community Response: From "Makes Sense" to "I Want to Try"

Karpathy's perspective sparked heated AI community discussion.

Chinese tech blogger Baoyu (@dotey) provided detailed interpretation, summarizing Karpathy's core points:

Disruptive idea: We fed AI the wrong "diet" from the start
Efficiency: Pixels are "high-density information bars," shorter context windows
Fidelity: Pixel input preserves styling, layout visual information
Bypass tokenizer: Let AI "see is believing," avoid tokenizer distortion
Input shift: AI's main gateway moving from "language" to "vision"

Baoyu's interpretation spread widely in Chinese AI circles, prompting developers to question: "Are text tokens really optimal?"

Karpathy himself admitted he now has to "fight the urge to side quest an image-input-only version of nanochat."

This programmer humor reflects reality: as a top AI researcher with countless "serious projects," Karpathy finds this DeepSeek-OCR "side effect" too tempting to resist immediate experimentation.

What Could Change?

If Karpathy's vision materializes, AI architecture could fundamentally transform:

1. Multimodal Models Become Default

Future "language models" may not purely process language, but natively possess visual understanding.

2. Context Window Problem Alleviated

If 100 visual tokens replace 1000 text tokens, a model handling 100K tokens could theoretically process information equivalent to 1M tokens.

3. Tokenizers Might Actually Disappear

At least for input, future models may directly receive pixels, skipping tokenization entirely.

4. AI "Memory" Mechanisms Redesigned

DeepSeek-OCR's "memory forgetting mechanism" (high-res images for recent info, low-res for distant info) could solve long-context problems.

Conclusion: An "OCR Paper" Sparking Paradigm Reflection

DeepSeek-OCR was a technical paper about optical character recognition, yet Karpathy's commentary evolved it into a paradigm discussion on "how AI should perceive the world."

As Karpathy stated:

"OCR is just one of many useful vision→text tasks. And text→text tasks can be made to be vision→text tasks. Not vice versa."

Maybe we really did feed AI the wrong "diet" from the start.

Maybe future AI should be like humans—using "eyes" to "see" the world, not just "ears" to "hear" words.

This isn't just technical optimization—it's a cognitive revolution.

References:

Andrej Karpathy's original tweet: https://x.com/karpathy/status/1980397031542989305
Baoyu's interpretation: https://x.com/dotey/status/1981156753191403606
DeepSeek-OCR paper: https://github.com/deepseek-ai/DeepSeek-OCR

About Ben / 浮浮酱

WaytoAGI 作者

https://deepseekocr.app