DeepSeek OCR 2: Advanced Optical Character Recognition Technology

    
        By vramkickedin    
     | 
    
            January 30, 2026 at 7:38 pm        
    
     | 
    
        2 min read

DeepSeek AI has unveiled DeepSeek OCR 2, a sophisticated optical character recognition (OCR) system that introduces significant improvements in vision token processing and encoder capabilities. Meaning this model can actually see the images like we do.

The new model builds upon the previous DeepSeek OCR framework, implementing advanced technical innovations that enhance image text extraction and processing. The research team developed a multi-stage training approach that focuses on encoder pretraining, query enhancement, and decoder specialization to optimize performance.

Technical Architecture and Features

The DeepSeek OCR 2 leverages a unique attention mask architecture that combines bidirectional and causal attention mechanisms for visual token processing. Key features include:

DeepEncoder V2 with Visual Causal Flow
Two-stage reasoning loop
Small 3B parameter size
Document and layout handling
Token and resolution strategy
Multilingual support
Multi-format input supporting pdfs, images and more

Training Methodology and Performance

DeepSeek AI implemented a comprehensive three-stage training pipeline to develop DeepSeek OCR 2. The training process involved:

Encoder pretraining using language modeling objectives
Query enhancement with unified data loading
Continued LLM training with frozen encoder parameters

The research utilized a whopping 160 A100 GPUs across 20 nodes, processing approximately 100 million image-text pair samples. The training configuration enabled advanced feature extraction and token compression capabilities, with a focus on improving visual knowledge representation.

Model Specifications

DeepSeek OCR 2 maintains the previous model's decoder structure, utilizing a 3B-parameter Mixture of Experts (MoE) framework with approximately 500M active parameters. The model supports visual token processing with a maximum of 1120 tokens, which is comparable to other advanced vision-language models like Gemini-3-Pro.