IBM Granite-4.0-3B-Vision Streamlines Document Data Extraction

3D sheet with glowing grid lines floating over a 3D slab of stone

Granite-4.0-3B-Vision is a new vision-language model from IBM Research designed specifically for extracting structured data from documents, charts, and tables. The model converts visual information into machine-readable formats like CSV files, JSON objects, and HTML tables, making it useful for enterprise document processing workflows.

IBM built this model to handle complex extraction tasks that smaller models often struggle with, such as pulling data from charts with unusual layouts or tables spanning multiple columns. The architecture combines a 3.5 billion parameter base language model with 0.5 billion parameter vision adapters, allowing a single deployment to serve both image-based and text-only requests.

What Granite-4.0-3B-Vision can do

  • Converts charts into CSV tables, Python code, or natural language summaries.
  • Extracts tables from document images into JSON, HTML, or OTSL formats.
  • Pulls key-value pairs from documents using custom JSON schemas.
  • Uses simple task tags that expand into full prompts automatically.
  • Supports both multimodal and text-only workloads from one deployment.
  • Works with vLLM for faster inference in production settings.

Small agencies and professionals who process invoices, reports, or financial documents can use this tool to automate data entry and reduce manual extraction work. The task tag system means users do not need to write long instructions—tags like or tell the model exactly what format to return.

Technical details and considerations

The model uses a SigLIP2 vision encoder that processes images in 384×384 patches, with visual features injected at eight different points throughout the language model. IBM trained Granite-4.0-3B-Vision on its Blue Vela supercomputing cluster using NVIDIA H100 GPUs over roughly 200 hours.

However, the team notes important limitations.

'Outputs should be validated before use in automated pipelines, particularly for high-stakes document processing,'

IBM warns. The model only supports English instructions and may produce degraded results for documents in other languages. It is also designed for structured extraction tasks and may not generalize well to open-ended vision-language work.

Get Granite-4.0-3B-Vision on Hugging Face.