A production-ready document intelligence platform that automates invoice and receipt data extraction using OCR and Large Language Models. The system processes document images asynchronously, extracts structured data (merchant, date, total, line items), and provides both REST API and web UI interfaces for viewing results. Fully containerized with Docker Compose for easy deployment and scaling. (GitHub)
Architecture Overview
The platform follows a microservices architecture with clear separation of concerns:
1. Frontend Layer (React + Nginx)
React Application: Modern single-page application for uploading documents and viewing extraction results
Nginx Web Server: Serves static React build files and handles reverse proxying
User Interface: File upload component, receipt status checker, and results display with extracted fields
2. Backend API (FastAPI)
RESTful API: FastAPI-based backend providing endpoints for:
POST /receipts/upload: Upload invoice/receipt images (multipart/form-data)
GET /receipts/{receipt_id}: Retrieve processing status and extracted data
GET /health: Health check endpoint for monitoring
Async Request Handling: Non-blocking I/O for concurrent uploads
File Management: Temporary storage and validation of uploaded images
3. Async Processing Pipeline (Celery + Redis)
Task Queue: Redis as message broker for Celery tasks
Worker Processes: Celery workers handle CPU-intensive OCR and LLM processing
Status Tracking: Real-time status updates (pending → processing → completed/error)
Scalability: Horizontal scaling by adding more Celery workers
4. OCR Processing (Tesseract)
Text Extraction: Tesseract OCR engine extracts raw text from invoice images
Image Preprocessing: Pillow library for image optimization (resize, contrast, grayscale)
Language Support: Configurable OCR language models
Error Handling: Robust error handling for low-quality or corrupted images
5. AI Data Extraction (LLM)
Dual Mode Support:
Cloud Mode: OpenAI GPT models (GPT-3.5, GPT-4) via API
Local Mode: Ollama for on-premise LLM inference (e.g., Llama 3, Mistral)
Structured Extraction: LLM parses OCR text to extract:
Merchant/Vendor name
Transaction date
Total amount
Line items (optional)
Tax information
Prompt Engineering: Optimized prompts for consistent JSON output
6. Data Persistence (PostgreSQL)
Database Schema: SQLAlchemy ORM models for:
Receipt metadata (ID, upload timestamp, status)
Original file paths
OCR-extracted text (full text)
LLM-extracted structured fields (JSON)
Migrations: Alembic for database schema versioning and migrations
Query Optimization: Indexed fields for fast status lookups
Technology Stack
Backend & Processing
FastAPI: Modern Python web framework with automatic OpenAPI documentation
Celery: Distributed task queue for asynchronous job processing
Redis: In-memory data store serving as Celery broker and result backend
PostgreSQL: Production-grade relational database with ACID compliance
SQLAlchemy: Python ORM for database abstraction and query building
Alembic: Database migration tool for schema management
OCR & Image Processing
Tesseract OCR: Open-source OCR engine (via pytesseract Python wrapper)
Pillow (PIL): Python imaging library for image preprocessing and manipulation
AI/ML
OpenAI API: Cloud-based LLM service (GPT-3.5, GPT-4) for data extraction
Ollama: Local LLM server supporting models like Llama 3, Mistral, CodeLlama
LangChain (potential): Framework for chaining LLM operations and prompt management
Frontend
React: Component-based UI library for building interactive interfaces
Nginx: High-performance web server and reverse proxy
Infrastructure
Docker: Containerization for consistent deployment across environments
Docker Compose: Multi-container orchestration for all services
Environment Variables: Secure configuration management via .env files
How It Works
Processing Workflow
Upload Phase:
User uploads invoice image via React frontend
FastAPI receives file, validates format (.png, .jpg)
File stored temporarily, receipt record created in PostgreSQL with status "pending"
Receipt ID returned to user
OCR Phase (Celery Task):
Celery worker picks up OCR task from Redis queue
Image preprocessed (resize, enhance contrast if needed)
Tesseract OCR extracts all text from image
Raw OCR text stored in database
Status updated to "processing"
AI Extraction Phase (Celery Task):
OCR text passed to LLM (OpenAI or Ollama based on configuration)
LLM prompt instructs extraction of structured fields
LLM returns JSON with merchant, date, total, etc.
Extracted data stored in PostgreSQL
Status updated to "processed"
Retrieval Phase:
User queries receipt status via UI or API
FastAPI retrieves data from PostgreSQL
Results displayed with extracted fields formatted
Configuration & Setup
Environment Variables (.env)
# Database Configuration
DATABASE_URL=postgresql://postgres:postgres@db:5432/invoice_db
# Celery Configuration
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_RESULT_BACKEND=redis://redis:6379/0
# LLM Configuration
LLM_MODE=cloud # Options: 'cloud' or 'local'
# OpenAI (if LLM_MODE=cloud)
OPENAI_API_KEY=your_openai_api_key_here
# Ollama (if LLM_MODE=local)
OLLAMA_BASE_URL=http://host.docker.internal:11434
Deployment
Copy# Build and start all services
docker-compose build
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
Potential Improvements & Enhancements
1. Advanced OCR Models
Upgrade to Cloud OCR Services:
Google Cloud Vision API: Higher accuracy, supports 200+ languages, handwriting recognition
AWS Textract: Specialized for document analysis, table extraction, form understanding
Azure Form Recognizer: Pre-built invoice models, automatic field extraction
Benefits: Better accuracy on low-quality images, table extraction, multi-language support
2. Enhanced LLM Integration
Structured Output Models:
Use OpenAI's structured outputs (JSON mode) for consistent parsing
Fine-tune models on invoice-specific datasets for better accuracy
Specialized Models:
Anthropic Claude: Better at following complex extraction instructions
Local Fine-tuned Models: Train on invoice datasets for domain-specific performance
Multi-step Extraction:
First pass: Extract basic fields (merchant, date, total)
Second pass: Extract line items, tax breakdown, payment terms
3. Document Understanding Frameworks
LangChain Document Loaders: Integrate with document processing pipelines
LlamaIndex: Build RAG (Retrieval-Augmented Generation) for context-aware extraction
Haystack: End-to-end NLP framework for document QA and extraction
4. Advanced Features
Multi-format Support: PDF parsing (PyPDF2, pdfplumber), scanned documents
Batch Processing: Upload multiple invoices, process in parallel
Export Functionality: Export to CSV, Excel, JSON, or accounting software formats (QuickBooks, Xero)
Validation & Verification: Cross-check extracted data, flag anomalies
Duplicate Detection: Identify duplicate invoices using fuzzy matching
5. Performance Optimizations
Caching: Redis cache for frequently accessed receipts
Image Optimization: Compress images before OCR to reduce processing time
Parallel Processing: Process multiple receipts concurrently with worker pools
CDN Integration: Serve frontend assets via CDN for faster load times
6. Monitoring & Observability
Logging: Structured logging with correlation IDs for request tracing
Metrics: Prometheus metrics for task queue depth, processing times, error rates
Alerting: Set up alerts for failed extractions, queue backups
Dashboard: Grafana dashboard for real-time system monitoring
7. Security Enhancements
Authentication: JWT-based auth for API endpoints
File Validation: Virus scanning, file type verification
Data Encryption: Encrypt sensitive invoice data at rest
Rate Limiting: Prevent abuse with request rate limits
8. User Experience
Real-time Updates: WebSocket/SSE for live status updates
Progress Indicators: Show OCR and LLM processing progress
Error Recovery: Retry failed extractions, manual correction interface
Bulk Operations: Upload folder of invoices, process all at once
Use Cases
Accounting Automation: Automate expense report processing
AP/AR Management: Streamline accounts payable and receivable workflows
Tax Preparation: Extract invoice data for tax filing
Expense Tracking: Personal or business expense categorization
Audit Trail: Maintain searchable database of all invoices
API Examples
Copy# Upload invoice
curl -X POST http://localhost:8000/receipts/upload \ \
-F "file=@invoice.jpg"
# Response: {"receipt_id": "abc123", "status": "pending"}
# Check status
curl http://localhost:8000/receipts/abc123
# Response:
# {
# "receipt_id": "abc123",
# "status": "processed",
# "merchant": "Amazon",
# "date": "2024-01-15",
# "total": 129.99
# }
(View on GitHub)