AI Data Extractor | Deepak Kumar

Built a Full-stack web-based Gen-AI tool for a financial company to automate data extraction from a variety of complex documents. Users can upload PDFs, images, or DOCX files; an AI-powered pipeline extracts key fields (details, tax breakdown, etc.) and saves the results to a consolidated spreadsheet. The system processes data in memory and only holds it temporarily, and files are not stored to help protect privacy. The site is fully responsive across mobile, tablet, and desktop. The project emphasizes cost-effectiveness, with ongoing work on new features and further cost optimization.

Application Previews:

What it does

Accepts uploaded documents or images (PDFs, PNGs, scanned figures)
Runs OCR to extract raw text and identify structured regions
Uses a GenAI model to interpret and reformat extracted content into structured output (tables, key-value pairs, numerical data)
Returns machine-readable output ready for further analysis
limited free access

Why it’s useful

For researchers dealing with legacy data locked in paper figures, or anyone trying to aggregate information from heterogeneous document sources, manual extraction is the bottleneck. This pipeline reduces that bottleneck without requiring domain-specific training for each document type.

Stack

Python · OCR (Tesseract / EasyOCR) · GenAI API · `` · Pandas Vanilla HTML/CSS/JavaScript . Cloudflare Workers (serverless edge backend) . Cloudflare KV (user + session store). Google Gemini Flash .vision OCR . pdf.js (client-side PDF to image conversion) . Mammoth.js (DOCX text extraction) . SheetJS (Excel generation in the browser) . Wrangler CLI (Cloudflare local dev + deploy)

What it does

Why it’s useful

Stack

Links