AI Document Processing Pipeline for Insurance
End-to-end automated insurance document processing system with NLP, reducing claim processing time from 3-5 days to <10 minutes
Technologies
A project for a mid-sized insurance company that was drowning in manual processing of insurance claim documents. The slowdown problem is as old as time - the company received data in chaotic forms:
- Dozens of Excel files from agents with different data models
- PDF documents: policies, medical reports, inspection reports
- Photos of damages and document scans from phones
A staff of ~8 employees spent 3-5 days processing one complex case. They had to manually combine data from different sources, bring it to a unified format, and only then validate cases and start analysis. Naturally, such slowdowns affected the speed and quality of work on simpler insurance cases.
Solution
I developed an automated document processing system that works in three stages:
Stage 1: Import and Normalization
The system automatically determines the document type and selects the correct processing method. Excel and CSV files are processed with standard data parsing. Scanned PDFs and photos first go through OCR to extract text.
Stage 2: Cleaning and Structuring
A micro-pipeline using Schema Guided Reasoning analyzes the extracted data and brings it to a unified format: normalizes column names, determines data types, finds actual or potential connections between data from different sources, reflecting confidence in the analysis, which is convenient for subsequent analysis by company specialists.
Stage 3: Validation and Storage
Data is checked against the company’s internal checklist, structured into typed models, and saved to the database with a full audit trail. If validation fails, we return with corrections to previous stages (feedback loop with repetition threshold).
The system provides a clear interface and API for further use of data by analysts.
The case is interesting because most documents are structured data in Excel in one way or another. Fast and cheap processing methods work well for them.
LLM is connected only where it’s really needed - for structured understanding of data schemas and processing complex cases, and for data extraction where it’s not possible to use transparent heuristic methods - scans or photos of documents. High accuracy requirements were not imposed on the latter, because both the pipeline results and original documents were ultimately worked on by live specialists.
And yet, the company’s throughput increased thanks to reducing analysis preparation time from several days to <10 minutes.
After fine-tuning the pipeline, the analysis accuracy reached a more than satisfactory level. Employees were finally able to fully engage in analytical work and client work - the endless fire in the company ended.