Datalab to Introduces Lift To Pull Neat Data From Messy Documents

The new release of Lift provides a way to pull organized data out of PDFs and images. Users can provide a standard JSON format, and the model will generate matching results while ensuring the output remains valid. This means it can read long documents in one go and output structured information like text, numbers, and lists.
The development team at Datalab-to created this project to help users easily gather information from complex documents. They built a system that supports both local hardware processing and remote server options for better speed and accuracy. The developers also included a web application to let people visually build and test their data formats.
Extracting data from multiple pages
- Pulls structured data from digital documents.
- Accepts any standard JSON schema format.
- Handles multi-page documents in one pass.
- Offers local and remote processing options.
- Provides a visual studio for schema testing.
This tool is built for people who need to organize messy document files into neat data without relying on cloud services. Running it locally allows individuals to process sensitive files entirely on their own hardware. Small teams can also use the command line tools to scan entire folders of documents automatically.
License details and performance notes
Testing shows this model achieves 90.2 percent accuracy on individual fields across a large benchmark of complex documents. The software code is open source, but the model weights have a special license that restricts competitive commercial use. Startups with less than five million dollars in funding can use the weights for free for their own internal purposes.
"Pass any JSON schema and lift returns a JSON object matching it, using schema-constrained decoding to guarantee valid, well-typed output." Source: Hugging Face