ML/AI

Airline Fare Prediction System

ML system predicting airline ticket prices using ensemble models trained on 50,000+ fare records. Includes booking lead time analysis and interactive Streamlit interface.

PythonScikit-LearnStreamlitPandasRandomForestGradientBoosting

01 — Problem

The Challenge

Airline ticket pricing is dynamic and opaque. Prices fluctuate based on route, time, booking lead time, airline, and seat availability. Travelers lack tools to objectively assess whether a current fare is reasonable or likely to change — leading to suboptimal purchase timing decisions.

For frequent travelers and budget-conscious individuals, even a 15-20% improvement in fare prediction accuracy translates to meaningful savings. The problem is well-defined enough for ML application with available public datasets.

02 — Approach

How I Approached It

Trained ensemble ML models (RandomForest and GradientBoosting) on a dataset of 50,000+ historical fare records. Engineered features including booking lead time, route characteristics, day-of-week, and airline class. Built a Streamlit interface for interactive fare prediction with confidence intervals.

Architecture

  • 01Data ingestion and cleaning pipeline using Pandas
  • 02Feature engineering: lead time buckets, route encoding, temporal features
  • 03Model training: RandomForest and GradientBoosting with cross-validation
  • 04Ensemble combination: weighted average of model predictions
  • 05Streamlit web interface for interactive prediction and visualization

03 — Technology

Technology Choices and Why

RandomForest

Handles non-linear fare relationships well; robust to outliers in pricing data

GradientBoosting

Captures complex feature interactions; outperforms single models on tabular pricing data

Streamlit

Rapid prototyping of interactive ML interfaces without frontend engineering overhead

Pandas

Efficient manipulation of 50K+ row dataset; rich API for feature engineering operations

04 — Challenges

Obstacles and Solutions

High variance in fare data

Applied log transformation to price target variable; improved RMSE by approximately 23% compared to raw price prediction

Feature leakage risk

Careful temporal split for train/test; validated that no future-date features were used in training set construction

Model interpretability

Added SHAP value analysis to explain top features driving individual predictions; useful for communicating model behavior

05 — Results

Outcomes

  • Prediction accuracy within 15% of actual fare for 78% of test cases
  • Trained on 50,000+ fare records across 6 major Indian routes
  • Booking lead time identified as highest-impact feature for fare variance
  • Interactive Streamlit UI with confidence interval display

06 — Learnings

What I Learned

  • Target variable transformation (log-price) often matters more than model selection for skewed financial data
  • Feature engineering contributes more to performance than hyperparameter tuning in most tabular ML problems
  • Model interpretability is a product requirement, not just a nice-to-have

Skills Used

Other Projects