Intelligent Document Processing Pipeline
Async OCR service that extracts structured financial data from scanned documents using computer vision preprocessing and LLM-powered extraction. Replaced manual 2-3 day processing with sub-minute automation.
Technologies
The Problem
A financial services company needed to extract structured data from thousands of bank statements, invoices, and remittance documents daily. The existing process was manual — operators would open each document, read it, and type the data into the system. Processing time: 2-3 days per batch. Error rate: unacceptable.
Previous automation attempts failed because scanned documents varied wildly in quality, format, and layout. Traditional OCR alone couldn’t handle the variance.
The Solution
I designed and built an async document processing service that combines computer vision preprocessing with LLM-powered structured extraction.
Processing pipeline:
- Document upload to S3 via async API
- Image preprocessing with CLAHE (Contrast Limited Adaptive Histogram Equalization) for scanned documents — dramatically improving OCR accuracy on low-quality scans
- Multi-engine text extraction (PyMuPDF for digital PDFs, OpenCV + OCR for scanned images)
- LLM-powered structured extraction via PydanticAI — converting raw text into validated, typed financial records
- Results stored in PostgreSQL with full audit trail
Architecture decisions:
- Task-based async processing instead of synchronous OCR — enables scalable handling of hundreds of concurrent document uploads
- Pydantic models for output validation — LLM extraction results are validated against strict schemas before storage, catching hallucinations at the boundary
- Multi-format support (PDF, Excel, CSV, scanned images) through a unified pipeline with format-specific preprocessing
- Health-aware deployment with connection pooling, graceful degradation, and structured logging
Business Impact
- Processing time: 2-3 days → under 1 minute per document — operations team reallocated from data entry to exception handling
- Eliminated manual data entry errors that previously caused reconciliation mismatches and delayed financial reporting
- Scales with the business — handles volume spikes without hiring temporary staff
- 10+ document formats from multiple financial institutions processed through a single unified pipeline
- Full audit trail — every extraction decision is traceable, meeting compliance requirements out of the box