Intelligent Document Processing Pipeline

Async OCR service that extracts structured financial data from scanned documents using computer vision preprocessing and LLM-powered extraction. Replaced manual 2-3 day processing with sub-minute automation.

Technologies

PythonFastAPIPydanticAIPostgreSQLAWS S3OpenCVPyMuPDFDocker

The Problem

A financial services company needed to extract structured data from thousands of bank statements, invoices, and remittance documents daily. The existing process was manual — operators would open each document, read it, and type the data into the system. Processing time: 2-3 days per batch. Error rate: unacceptable.

Previous automation attempts failed because scanned documents varied wildly in quality, format, and layout. Traditional OCR alone couldn’t handle the variance.

The Solution

I designed and built an async document processing service that combines computer vision preprocessing with LLM-powered structured extraction.

Processing pipeline:

  1. Document upload to S3 via async API
  2. Image preprocessing with CLAHE (Contrast Limited Adaptive Histogram Equalization) for scanned documents — dramatically improving OCR accuracy on low-quality scans
  3. Multi-engine text extraction (PyMuPDF for digital PDFs, OpenCV + OCR for scanned images)
  4. LLM-powered structured extraction via PydanticAI — converting raw text into validated, typed financial records
  5. Results stored in PostgreSQL with full audit trail

Architecture decisions:

  • Task-based async processing instead of synchronous OCR — enables scalable handling of hundreds of concurrent document uploads
  • Pydantic models for output validation — LLM extraction results are validated against strict schemas before storage, catching hallucinations at the boundary
  • Multi-format support (PDF, Excel, CSV, scanned images) through a unified pipeline with format-specific preprocessing
  • Health-aware deployment with connection pooling, graceful degradation, and structured logging

Business Impact

  • Processing time: 2-3 days → under 1 minute per document — operations team reallocated from data entry to exception handling
  • Eliminated manual data entry errors that previously caused reconciliation mismatches and delayed financial reporting
  • Scales with the business — handles volume spikes without hiring temporary staff
  • 10+ document formats from multiple financial institutions processed through a single unified pipeline
  • Full audit trail — every extraction decision is traceable, meeting compliance requirements out of the box
Book Intro Call