Intelligent Document Processing Pipeline

Async OCR service that extracts structured financial data from scanned documents using computer vision preprocessing and LLM-powered extraction. Replaced manual 2-3 day processing with sub-minute automation.

Technologies

PythonFastAPIPydanticAIPostgreSQLAWS S3OpenCVPyMuPDFDocker

The Problem

A financial services company needed to extract structured data from thousands of bank statements, invoices, and remittance documents daily. The existing process was manual — operators would open each document, read it, and type the data into the system. Processing time: 2-3 days per batch. Error rate: unacceptable.

Previous automation attempts failed because scanned documents varied wildly in quality, format, and layout. Traditional OCR alone couldn’t handle the variance.

The Solution

I designed and built an async document processing service that combines computer vision preprocessing with LLM-powered structured extraction.

Processing pipeline:

Document upload to S3 via async API
Image preprocessing with CLAHE (Contrast Limited Adaptive Histogram Equalization) for scanned documents — dramatically improving OCR accuracy on low-quality scans
Multi-engine text extraction (PyMuPDF for digital PDFs, OpenCV + OCR for scanned images)
LLM-powered structured extraction via PydanticAI — converting raw text into validated, typed financial records
Results stored in PostgreSQL with full audit trail

Architecture decisions:

Task-based async processing instead of synchronous OCR — enables scalable handling of hundreds of concurrent document uploads
Pydantic models for output validation — LLM extraction results are validated against strict schemas before storage, catching hallucinations at the boundary
Multi-format support (PDF, Excel, CSV, scanned images) through a unified pipeline with format-specific preprocessing
Health-aware deployment with connection pooling, graceful degradation, and structured logging

Business Impact

Processing time: 2-3 days → under 1 minute per document — operations team reallocated from data entry to exception handling
Eliminated manual data entry errors that previously caused reconciliation mismatches and delayed financial reporting
Scales with the business — handles volume spikes without hiring temporary staff
10+ document formats from multiple financial institutions processed through a single unified pipeline
Full audit trail — every extraction decision is traceable, meeting compliance requirements out of the box

The Problem

The Solution

Business Impact

Related Projects

Intelligent Document Processing Pipeline

Production Agentic AI System for Monite API

AI Document Processing Pipeline for Insurance