Personal ProjectN/AN/A

Invoice Analysis & Export Platform

A production-ready document intelligence platform that automates invoice and receipt data extraction using OCR and Large Language Models. The system processes document images asynchronously, extracts structured data (merchant, date, total, line items), and provides both REST API and web UI interfaces for viewing results. Fully containerized with Docker Compose for easy deployment and scaling. (GitHub)

Architecture Overview

The platform follows a microservices architecture with clear separation of concerns:

1. Frontend Layer (React + Nginx)

React Application: Modern single-page application for uploading documents and viewing extraction results
Nginx Web Server: Serves static React build files and handles reverse proxying
User Interface: File upload component, receipt status checker, and results display with extracted fields

2. Backend API (FastAPI)

RESTful API: FastAPI-based backend providing endpoints for:
- POST /receipts/upload: Upload invoice/receipt images (multipart/form-data)
- GET /receipts/{receipt_id}: Retrieve processing status and extracted data
- GET /health: Health check endpoint for monitoring
Async Request Handling: Non-blocking I/O for concurrent uploads
File Management: Temporary storage and validation of uploaded images

3. Async Processing Pipeline (Celery + Redis)

Task Queue: Redis as message broker for Celery tasks
Worker Processes: Celery workers handle CPU-intensive OCR and LLM processing
Status Tracking: Real-time status updates (pending → processing → completed/error)
Scalability: Horizontal scaling by adding more Celery workers

4. OCR Processing (Tesseract)

Text Extraction: Tesseract OCR engine extracts raw text from invoice images
Image Preprocessing: Pillow library for image optimization (resize, contrast, grayscale)
Language Support: Configurable OCR language models
Error Handling: Robust error handling for low-quality or corrupted images

5. AI Data Extraction (LLM)

Dual Mode Support:
- Cloud Mode: OpenAI GPT models (GPT-3.5, GPT-4) via API
- Local Mode: Ollama for on-premise LLM inference (e.g., Llama 3, Mistral)
Structured Extraction: LLM parses OCR text to extract:
- Merchant/Vendor name
- Transaction date
- Total amount
- Line items (optional)
- Tax information
Prompt Engineering: Optimized prompts for consistent JSON output

6. Data Persistence (PostgreSQL)

Database Schema: SQLAlchemy ORM models for:
- Receipt metadata (ID, upload timestamp, status)
- Original file paths
- OCR-extracted text (full text)
- LLM-extracted structured fields (JSON)
Migrations: Alembic for database schema versioning and migrations
Query Optimization: Indexed fields for fast status lookups

Technology Stack

Backend & Processing

FastAPI: Modern Python web framework with automatic OpenAPI documentation
Celery: Distributed task queue for asynchronous job processing
Redis: In-memory data store serving as Celery broker and result backend
PostgreSQL: Production-grade relational database with ACID compliance
SQLAlchemy: Python ORM for database abstraction and query building
Alembic: Database migration tool for schema management

OCR & Image Processing

Tesseract OCR: Open-source OCR engine (via pytesseract Python wrapper)
Pillow (PIL): Python imaging library for image preprocessing and manipulation

AI/ML

OpenAI API: Cloud-based LLM service (GPT-3.5, GPT-4) for data extraction
Ollama: Local LLM server supporting models like Llama 3, Mistral, CodeLlama
LangChain (potential): Framework for chaining LLM operations and prompt management

Frontend

React: Component-based UI library for building interactive interfaces
Nginx: High-performance web server and reverse proxy

Infrastructure

Docker: Containerization for consistent deployment across environments
Docker Compose: Multi-container orchestration for all services
Environment Variables: Secure configuration management via .env files

How It Works

Processing Workflow

Upload Phase:
- User uploads invoice image via React frontend
- FastAPI receives file, validates format (.png, .jpg)
- File stored temporarily, receipt record created in PostgreSQL with status "pending"
- Receipt ID returned to user
OCR Phase (Celery Task):
- Celery worker picks up OCR task from Redis queue
- Image preprocessed (resize, enhance contrast if needed)
- Tesseract OCR extracts all text from image
- Raw OCR text stored in database
- Status updated to "processing"
AI Extraction Phase (Celery Task):
- OCR text passed to LLM (OpenAI or Ollama based on configuration)
- LLM prompt instructs extraction of structured fields
- LLM returns JSON with merchant, date, total, etc.
- Extracted data stored in PostgreSQL
- Status updated to "processed"
Retrieval Phase:
- User queries receipt status via UI or API
- FastAPI retrieves data from PostgreSQL
- Results displayed with extracted fields formatted

Configuration & Setup

Environment Variables (`.env`)

# Database Configuration
DATABASE_URL=postgresql://postgres:postgres@db:5432/invoice_db

# Celery Configuration
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_RESULT_BACKEND=redis://redis:6379/0

# LLM Configuration
LLM_MODE=cloud  # Options: 'cloud' or 'local'

# OpenAI (if LLM_MODE=cloud)
OPENAI_API_KEY=your_openai_api_key_here

# Ollama (if LLM_MODE=local)
OLLAMA_BASE_URL=http://host.docker.internal:11434

Deployment

# Build and start all services
docker-compose build
docker-compose up -d

# View logs
docker-compose logs -f

# Stop services
docker-compose down

Potential Improvements & Enhancements

1. Advanced OCR Models

Upgrade to Cloud OCR Services:
- Google Cloud Vision API: Higher accuracy, supports 200+ languages, handwriting recognition
- AWS Textract: Specialized for document analysis, table extraction, form understanding
- Azure Form Recognizer: Pre-built invoice models, automatic field extraction
Benefits: Better accuracy on low-quality images, table extraction, multi-language support

2. Enhanced LLM Integration

Structured Output Models:
- Use OpenAI's structured outputs (JSON mode) for consistent parsing
- Fine-tune models on invoice-specific datasets for better accuracy
Specialized Models:
- Anthropic Claude: Better at following complex extraction instructions
- Local Fine-tuned Models: Train on invoice datasets for domain-specific performance
Multi-step Extraction:
- First pass: Extract basic fields (merchant, date, total)
- Second pass: Extract line items, tax breakdown, payment terms

3. Document Understanding Frameworks

LangChain Document Loaders: Integrate with document processing pipelines
LlamaIndex: Build RAG (Retrieval-Augmented Generation) for context-aware extraction
Haystack: End-to-end NLP framework for document QA and extraction

4. Advanced Features

Multi-format Support: PDF parsing (PyPDF2, pdfplumber), scanned documents
Batch Processing: Upload multiple invoices, process in parallel
Export Functionality: Export to CSV, Excel, JSON, or accounting software formats (QuickBooks, Xero)
Validation & Verification: Cross-check extracted data, flag anomalies
Duplicate Detection: Identify duplicate invoices using fuzzy matching

5. Performance Optimizations

Caching: Redis cache for frequently accessed receipts
Image Optimization: Compress images before OCR to reduce processing time
Parallel Processing: Process multiple receipts concurrently with worker pools
CDN Integration: Serve frontend assets via CDN for faster load times

6. Monitoring & Observability

Logging: Structured logging with correlation IDs for request tracing
Metrics: Prometheus metrics for task queue depth, processing times, error rates
Alerting: Set up alerts for failed extractions, queue backups
Dashboard: Grafana dashboard for real-time system monitoring

7. Security Enhancements

Authentication: JWT-based auth for API endpoints
File Validation: Virus scanning, file type verification
Data Encryption: Encrypt sensitive invoice data at rest
Rate Limiting: Prevent abuse with request rate limits

8. User Experience

Real-time Updates: WebSocket/SSE for live status updates
Progress Indicators: Show OCR and LLM processing progress
Error Recovery: Retry failed extractions, manual correction interface
Bulk Operations: Upload folder of invoices, process all at once

Use Cases

Accounting Automation: Automate expense report processing
AP/AR Management: Streamline accounts payable and receivable workflows
Tax Preparation: Extract invoice data for tax filing
Expense Tracking: Personal or business expense categorization
Audit Trail: Maintain searchable database of all invoices

API Examples

# Upload invoice
curl -X POST http://localhost:8000/receipts/upload \\
  -F "file=@invoice.jpg"

# Response: {"receipt_id": "abc123", "status": "pending"}

# Check status
curl http://localhost:8000/receipts/abc123

# Response:
# {
#   "receipt_id": "abc123",
#   "status": "processed",
#   "merchant": "Amazon",
#   "date": "2024-01-15",
#   "total": 129.99
# }

(View on GitHub)

Back to Portfolio

Follow Me