Invoice Analysis & Export Platform
Personal Project
Duration: N/A
Location: N/A
Project Details
A production-ready document intelligence platform that automates invoice and receipt data extraction using OCR and Large Language Models. The system processes document images asynchronously, extracts structured data (merchant, date, total, line items), and provides both REST API and web UI interfaces for viewing results. Fully containerized with Docker Compose for easy deployment and scaling. (GitHub)
Architecture Overview
The platform follows a microservices architecture with clear separation of concerns:
1. Frontend Layer (React + Nginx)
- React Application: Modern single-page application for uploading documents and viewing extraction results
- Nginx Web Server: Serves static React build files and handles reverse proxying
- User Interface: File upload component, receipt status checker, and results display with extracted fields
2. Backend API (FastAPI)
- RESTful API: FastAPI-based backend providing endpoints for:
POST /receipts/upload: Upload invoice/receipt images (multipart/form-data)GET /receipts/{receipt_id}: Retrieve processing status and extracted dataGET /health: Health check endpoint for monitoring
- Async Request Handling: Non-blocking I/O for concurrent uploads
- File Management: Temporary storage and validation of uploaded images
3. Async Processing Pipeline (Celery + Redis)
- Task Queue: Redis as message broker for Celery tasks
- Worker Processes: Celery workers handle CPU-intensive OCR and LLM processing
- Status Tracking: Real-time status updates (pending → processing → completed/error)
- Scalability: Horizontal scaling by adding more Celery workers
4. OCR Processing (Tesseract)
- Text Extraction: Tesseract OCR engine extracts raw text from invoice images
- Image Preprocessing: Pillow library for image optimization (resize, contrast, grayscale)
- Language Support: Configurable OCR language models
- Error Handling: Robust error handling for low-quality or corrupted images
5. AI Data Extraction (LLM)
- Dual Mode Support:
- Cloud Mode: OpenAI GPT models (GPT-3.5, GPT-4) via API
- Local Mode: Ollama for on-premise LLM inference (e.g., Llama 3, Mistral)
- Structured Extraction: LLM parses OCR text to extract:
- Merchant/Vendor name
- Transaction date
- Total amount
- Line items (optional)
- Tax information
- Prompt Engineering: Optimized prompts for consistent JSON output
6. Data Persistence (PostgreSQL)
- Database Schema: SQLAlchemy ORM models for:
- Receipt metadata (ID, upload timestamp, status)
- Original file paths
- OCR-extracted text (full text)
- LLM-extracted structured fields (JSON)
- Migrations: Alembic for database schema versioning and migrations
- Query Optimization: Indexed fields for fast status lookups
Technology Stack
Backend & Processing
- FastAPI: Modern Python web framework with automatic OpenAPI documentation
- Celery: Distributed task queue for asynchronous job processing
- Redis: In-memory data store serving as Celery broker and result backend
- PostgreSQL: Production-grade relational database with ACID compliance
- SQLAlchemy: Python ORM for database abstraction and query building
- Alembic: Database migration tool for schema management
OCR & Image Processing
- Tesseract OCR: Open-source OCR engine (via pytesseract Python wrapper)
- Pillow (PIL): Python imaging library for image preprocessing and manipulation
AI/ML
- OpenAI API: Cloud-based LLM service (GPT-3.5, GPT-4) for data extraction
- Ollama: Local LLM server supporting models like Llama 3, Mistral, CodeLlama
- LangChain (potential): Framework for chaining LLM operations and prompt management
Frontend
- React: Component-based UI library for building interactive interfaces
- Nginx: High-performance web server and reverse proxy
Infrastructure
- Docker: Containerization for consistent deployment across environments
- Docker Compose: Multi-container orchestration for all services
- Environment Variables: Secure configuration management via
.env files
How It Works
Processing Workflow
Upload Phase:
- User uploads invoice image via React frontend
- FastAPI receives file, validates format (.png, .jpg)
- File stored temporarily, receipt record created in PostgreSQL with status "pending"
- Receipt ID returned to user
OCR Phase (Celery Task):
- Celery worker picks up OCR task from Redis queue
- Image preprocessed (resize, enhance contrast if needed)
- Tesseract OCR extracts all text from image
- Raw OCR text stored in database
- Status updated to "processing"
AI Extraction Phase (Celery Task):
- OCR text passed to LLM (OpenAI or Ollama based on configuration)
- LLM prompt instructs extraction of structured fields
- LLM returns JSON with merchant, date, total, etc.
- Extracted data stored in PostgreSQL
- Status updated to "processed"
Retrieval Phase:
- User queries receipt status via UI or API
- FastAPI retrieves data from PostgreSQL
- Results displayed with extracted fields formatted
Configuration & Setup
Environment Variables (.env)
# Database Configuration
DATABASE_URL=postgresql://postgres:postgres@db:5432/invoice_db
# Celery Configuration
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_RESULT_BACKEND=redis://redis:6379/0
# LLM Configuration
LLM_MODE=cloud # Options: 'cloud' or 'local'
# OpenAI (if LLM_MODE=cloud)
OPENAI_API_KEY=your_openai_api_key_here
# Ollama (if LLM_MODE=local)
OLLAMA_BASE_URL=http://host.docker.internal:11434
Deployment
# Build and start all services
docker-compose build
docker-compose up -d
# View logs
docker-compose logs -f
# Stop services
docker-compose down
Potential Improvements & Enhancements
1. Advanced OCR Models
- Upgrade to Cloud OCR Services:
- Google Cloud Vision API: Higher accuracy, supports 200+ languages, handwriting recognition
- AWS Textract: Specialized for document analysis, table extraction, form understanding
- Azure Form Recognizer: Pre-built invoice models, automatic field extraction
- Benefits: Better accuracy on low-quality images, table extraction, multi-language support
2. Enhanced LLM Integration
- Structured Output Models:
- Use OpenAI's structured outputs (JSON mode) for consistent parsing
- Fine-tune models on invoice-specific datasets for better accuracy
- Specialized Models:
- Anthropic Claude: Better at following complex extraction instructions
- Local Fine-tuned Models: Train on invoice datasets for domain-specific performance
- Multi-step Extraction:
- First pass: Extract basic fields (merchant, date, total)
- Second pass: Extract line items, tax breakdown, payment terms
3. Document Understanding Frameworks
- LangChain Document Loaders: Integrate with document processing pipelines
- LlamaIndex: Build RAG (Retrieval-Augmented Generation) for context-aware extraction
- Haystack: End-to-end NLP framework for document QA and extraction
4. Advanced Features
- Multi-format Support: PDF parsing (PyPDF2, pdfplumber), scanned documents
- Batch Processing: Upload multiple invoices, process in parallel
- Export Functionality: Export to CSV, Excel, JSON, or accounting software formats (QuickBooks, Xero)
- Validation & Verification: Cross-check extracted data, flag anomalies
- Duplicate Detection: Identify duplicate invoices using fuzzy matching
5. Performance Optimizations
- Caching: Redis cache for frequently accessed receipts
- Image Optimization: Compress images before OCR to reduce processing time
- Parallel Processing: Process multiple receipts concurrently with worker pools
- CDN Integration: Serve frontend assets via CDN for faster load times
6. Monitoring & Observability
- Logging: Structured logging with correlation IDs for request tracing
- Metrics: Prometheus metrics for task queue depth, processing times, error rates
- Alerting: Set up alerts for failed extractions, queue backups
- Dashboard: Grafana dashboard for real-time system monitoring
7. Security Enhancements
- Authentication: JWT-based auth for API endpoints
- File Validation: Virus scanning, file type verification
- Data Encryption: Encrypt sensitive invoice data at rest
- Rate Limiting: Prevent abuse with request rate limits
8. User Experience
- Real-time Updates: WebSocket/SSE for live status updates
- Progress Indicators: Show OCR and LLM processing progress
- Error Recovery: Retry failed extractions, manual correction interface
- Bulk Operations: Upload folder of invoices, process all at once
Use Cases
- Accounting Automation: Automate expense report processing
- AP/AR Management: Streamline accounts payable and receivable workflows
- Tax Preparation: Extract invoice data for tax filing
- Expense Tracking: Personal or business expense categorization
- Audit Trail: Maintain searchable database of all invoices
API Examples
# Upload invoice
curl -X POST http://localhost:8000/receipts/upload \
-F "file=@invoice.jpg"
# Response: {"receipt_id": "abc123", "status": "pending"}
# Check status
curl http://localhost:8000/receipts/abc123
# Response:
# {
# "receipt_id": "abc123",
# "status": "processed",
# "merchant": "Amazon",
# "date": "2024-01-15",
# "total": 129.99
# }
(View on GitHub)