HackathonData EngineeringML/AI

Adobe Hackathon 2025 — PDF Intelligence Pipeline

Enterprise-grade PDF intelligence system with semantic search, multilingual support, and clause-level extraction. Round-2 Finalist from 4,000+ teams.

PythonTF-IDFFastAPIDockerFAISSPandas

GitHub Repository

01 — Problem

The Challenge

Organizations processing large volumes of PDF documents — contracts, research papers, compliance documents — face a critical bottleneck: finding specific information across thousands of pages is slow, error-prone, and language-dependent. Standard keyword search misses semantic context, and most tools cannot handle multilingual documents.

In regulated industries like finance and legal, failure to locate a specific clause or condition in a document can result in compliance failures and significant financial risk. The problem is not just search — it is intelligent, reliable, structured extraction at scale.

02 — Approach

How I Approached It

Built a document intelligence pipeline with TF-IDF semantic search combined with FAISS vector indexing for fast approximate nearest-neighbor retrieval. The system parses PDFs into structured chunks, applies multilingual preprocessing, indexes them, and returns clause-level results ranked by relevance. Deployed as a containerized FastAPI service for consistent, offline-capable operation.

Architecture

01PDF ingestion layer: parse raw documents into structured JSON chunks
02Text preprocessing: language detection, normalization, stop-word filtering
03TF-IDF vectorization with FAISS indexing for sub-100ms query response
04FastAPI REST interface exposing search and extraction endpoints
05Docker containerization for environment-agnostic deployment

03 — Technology

Technology Choices and Why

Python

Mature PDF processing ecosystem (PyMuPDF, pdfplumber) and ML tooling

TF-IDF + FAISS

TF-IDF captures term importance; FAISS enables fast vector search at scale without GPU dependency

FastAPI

Asynchronous Python framework; auto-generates OpenAPI docs; production-grade performance

Docker

Guarantees reproducible deployment across environments; required for offline enterprise use

Pandas

Efficient tabular manipulation of extracted clause metadata and search result ranking

04 — Challenges

Obstacles and Solutions

Multilingual document handling

Integrated language-aware tokenization; tested against English, Hindi, and mixed documents; normalized Unicode characters before vectorization

Search latency on large corpora

Replaced brute-force cosine similarity with FAISS approximate nearest neighbor; achieved sub-100ms response on 10,000+ document chunks

Containerized offline deployment

Pre-built Docker image with all models bundled; eliminated runtime internet dependency; reduced cold-start time through model caching

05 — Results

Outcomes

—Adobe Hackathon Round-2 Finalist — top performers from 4,000+ competing teams
—Sub-100ms query response on 10,000+ indexed document chunks
—Multilingual support across 3+ languages with consistent extraction accuracy
—Fully offline-capable containerized deployment

06 — Learnings

What I Learned

—Production readiness requires explicit decisions about deployment environment constraints early in design
—Vector search tradeoffs: exact similarity vs. approximate nearest neighbor matter significantly at scale
—Containerization is not just DevOps — it is a design decision that shapes how the system handles dependencies

Skills Used

Python Data Engineering ML/AI Docker FastAPI Search Systems

Other Projects

AgriConnect — Smart Crop Planner

Full-stack agricultural platform combining real-time weather data, ML-powered crop recomme...

Airline Fare Prediction

ML system predicting airline ticket prices using ensemble models trained on 50,000+ fare r...