Customer Lifetime Value Prediction

A Big Data & Machine Learning Pipeline for Customer Lifetime Value Prediction in Retail

View Report

Presented By

Lyaba Farooq

Muskan Asad

Yusra Khan

(Students of Data Science – IMSciences, Peshawar)

1. Problem Identification

This use case is critical because it allows the business to move beyond treating all customers equally, instead focusing efforts on those identified as the most valuable.

Use Case Selected

We selected Customer Lifetime Value (CLV) Prediction from the retail analytics domain.

Business Goal

The aim is to use data analysis to identify high-value customers by predicting their future spending patterns.

Strategic Action

Prediction enables targeted marketing offers, prioritized sales efforts, and proactive customer service to improve retention for key customers.

Project Scope

Design a scalable pipeline that converts raw transaction history into a predicted numerical CLV score for each customer.

2. Data Sourcing

The chosen dataset is Online Retail II, a publicly available, real-world transactional dataset well-suited for a big data application.

Dataset Description

It contains detailed transaction records from a UK-based online retail store. Its large volume makes it ideal for calculating RFM metrics (Recency, Frequency, Monetary Value) necessary for CLV prediction.

View Kaggle Dataset

3. Pipeline Design

We are designing a robust, scalable pipeline suitable for large, transactional data that facilitates subsequent machine learning tasks.

1. Data Ingestion

Batch processing using Apache Spark to load raw CSV files from HDFS or AWS S3 (distributed reading).

2. Storage Layer

Raw data in HDFS/S3 (Data Lake). Processed, aggregated data in Parquet columnar format for optimized reading speed.

3. Processing (ETL)

Distributed ETL using PySpark. Includes cleaning (removing missing IDs, negative values) and aggregating data to create RFM features.

4. Analytics & Modeling

Training regression models (primarily XGBoost Regressor) on feature-engineered data, with Linear Regression as a baseline.

5. Visualization Layer

Dashboards using Tableau or Power BI to present CLV predictions and customer segments for stakeholders.

Full Architectural Diagram is included in the Appendix of the PDF submission.

Big Data Pipeline Architecture Diagram

4. Machine Learning Methodology

Pre-processing Strategy

  • Handling Missing Values: Rows with missing `CustomerID` are removed, and invalid records (negative quantity/price) are also removed.
  • Categorical Variables: The `Country` variable uses One-Hot Encoding because it is a nominal (non-ordered) variable.
  • Feature Engineering: The creation of RFM features (Recency, Frequency, Monetary) is the fundamental step for CLV.
  • Scaling Strategy: Standardization (Z-score scaling) is applied to all numerical features to improve training stability for gradient boosting models.

Algorithm Recommendation: XGBoost Regressor

Recommended Model: XGBoost Regressor (Gradient Boosted Decision Tree)

Justification:

  • Problem Type: CLV is a regression task that predicts a continuous future spending value.
  • Data Suitability: XGBoost excels on structured/tabular data common in retail analytics.
  • Robustness: It is highly robust to outliers and skewed distributions, which are common in CLV data.

Dataset Analysis (Online Retail II)

Time-Series Nature

Data is structured, multivariate, and time-series in nature, requiring time-based feature calculation and modeling.

High Imbalance

The target CLV variable is highly imbalanced and right-skewed. This is handled effectively by XGBoost.

Dimensionality

Raw data is low-dimensional, but feature engineering creates a medium-dimensional dataset (15-25 features).

5. Implementation Plan

Library Selection

Pseudo-Code / Logic Flow


1. LOAD raw transaction data into a PySpark DataFrame.
2. CLEAN data: Remove canceled invoices, missing CustomerID, or invalid values.
3. CREATE customer-level features: Group by CustomerID and compute RFM features.
4. DEFINE TARGET (CLV): Calculate future total spend as the CLV label.
5. PRE-PROCESS: Apply One-Hot Encoding and StandardScaler to features.
6. SPLIT the feature set into training (80%) and testing (20%) sets.
7. TRAIN the main model: Fit the XGBoost Regressor on the training set.
8. EVALUATE model: Predict CLV on the test set and calculate evaluation metrics.
9. INTERPRET results: Analyze feature importance from XGBoost.
10. DEPLOY: Save the final trained model for production use.
            

6. Evaluation Metrics

Since CLV is a regression problem, metrics that measure the error magnitude in currency terms are required.

RMSE (Primary Metric)

Root Mean Squared Error. It is the most critical metric because it heavily penalizes large prediction errors. Minimizing large errors on high-value customers is the top business priority.

MAE

Mean Absolute Error. Provides an easily interpretable average error in currency units, useful for business reporting.

R-squared (Coefficient of Determination). Explains the percentage of variance in customer value that the model accounts for.

7. Conclusion

This proposal details a comprehensive, scalable Customer Lifetime Value prediction solution utilizing a PySpark-based pipeline and the high-performance XGBoost Regressor. The focus on robust feature engineering (RFM) and the critical evaluation metric of RMSE ensures the resulting model is highly accurate and directly supports strategic business decisions.

Accurate, Scalable, Business-Oriented CLV Prediction Pipeline

Customer Lifetime Value Prediction – Big Data Analytics Project

Customer Lifetime Value Prediction

A Big Data & Machine Learning Pipeline for Customer Lifetime Value Prediction in Retail

View Report

Presented By

Lyaba Farooq

Muskan Asad

Yusra Khan

(Students of Data Science – IMSciences, Peshawar)

1. Problem Identification

This use case is critical because it allows the business to move beyond treating all customers equally, instead focusing efforts on those identified as the most valuable.

Use Case Selected

We selected Customer Lifetime Value (CLV) Prediction from the retail analytics domain.

Business Goal

The aim is to use data analysis to identify high-value customers by predicting their future spending patterns.

Strategic Action

Prediction enables targeted marketing offers, prioritized sales efforts, and proactive customer service to improve retention for key customers.

Project Scope

Design a scalable pipeline that converts raw transaction history into a predicted numerical CLV score for each customer.

2. Data Sourcing

The chosen dataset is Online Retail II, a publicly available, real-world transactional dataset well-suited for a big data application.

Dataset Description

It contains detailed transaction records from a UK-based online retail store. Its large volume makes it ideal for calculating RFM metrics (Recency, Frequency, Monetary Value) necessary for CLV prediction.

View Kaggle Dataset

3. Pipeline Design

We are designing a robust, scalable pipeline suitable for large, transactional data that facilitates subsequent machine learning tasks.

1. Data Ingestion

Batch processing using Apache Spark to load raw CSV files from HDFS or AWS S3 (distributed reading).

2. Storage Layer

Raw data in HDFS/S3 (Data Lake). Processed, aggregated data in Parquet columnar format for optimized reading speed.

3. Processing (ETL)

Distributed ETL using PySpark. Includes cleaning (removing missing IDs, negative values) and aggregating data to create RFM features.

4. Analytics & Modeling

Training regression models (primarily XGBoost Regressor) on feature-engineered data, with Linear Regression as a baseline.

5. Visualization Layer

Dashboards using Tableau or Power BI to present CLV predictions and customer segments for stakeholders.

End-to-End Architecture Diagram

A visual blueprint of the Big Data and Machine Learning flow.

Big Data Pipeline Architecture Diagram

4. Machine Learning Methodology

Pre-processing Strategy

  • Handling Missing Values: Rows with missing CustomerID are removed, and invalid records (negative quantity/price) are also removed.
  • Categorical Variables: The Country variable uses One-Hot Encoding because it is a nominal (non-ordered) variable.
  • Feature Engineering: The creation of RFM features (Recency, Frequency, Monetary) is the fundamental step for CLV.
  • Scaling Strategy: Standardization (Z-score scaling) is applied to all numerical features to improve training stability for gradient boosting models.

Algorithm Recommendation: XGBoost Regressor

Recommended Model: XGBoost Regressor (Gradient Boosted Decision Tree)

Justification:

  • Problem Type: CLV is a regression task that predicts a continuous future spending value.
  • Data Suitability: XGBoost excels on structured/tabular data common in retail analytics.
  • Robustness: It is highly robust to outliers and skewed distributions, which are common in CLV data.

Dataset Analysis (Online Retail II)

Time-Series Nature

Data is structured, multivariate, and time-series in nature, requiring time-based feature calculation and modeling.

High Imbalance

The target CLV variable is highly imbalanced and right-skewed. This is handled effectively by XGBoost.

Dimensionality

Raw data is low-dimensional, but feature engineering creates a medium-dimensional dataset (15-25 features).

5. Implementation Plan

Library Selection

  • Data Processing: pyspark (large-scale ETL and aggregation) and pandas (manipulation)
  • Modeling: xgboost (main regressor) and scikit-learn (scaling/splitting)
  • Visualization: matplotlib and seaborn (feature importance/diagnostics)

Pseudo-Code / Logic Flow


1. LOAD raw transaction data into a PySpark DataFrame.
2. CLEAN data: Remove canceled invoices, missing CustomerID, or invalid values.
3. CREATE customer-level features: Group by CustomerID and compute RFM features.
4. DEFINE TARGET (CLV): Calculate future total spend as the CLV label.
5. PRE-PROCESS: Apply One-Hot Encoding and StandardScaler to features.
6. SPLIT the feature set into training (80%) and testing (20%) sets.
7. TRAIN the main model: Fit the XGBoost Regressor on the training set.
8. EVALUATE model: Predict CLV on the test set and calculate evaluation metrics.
9. INTERPRET results: Analyze feature importance from XGBoost.
10. DEPLOY: Save the final trained model for production use.
                

6. Evaluation Metrics

Since CLV is a regression problem, metrics that measure the error magnitude in currency terms are required.

RMSE (Primary Metric)

Root Mean Squared Error. It is the most critical metric because it heavily penalizes large prediction errors. Minimizing large errors on high-value customers is the top business priority.

MAE

Mean Absolute Error. Provides an easily interpretable average error in currency units, useful for business reporting.

R-squared (Coefficient of Determination). Explains the percentage of variance in customer value that the model accounts for.

7. Conclusion

This proposal details a comprehensive, scalable Customer Lifetime Value prediction solution utilizing a PySpark-based pipeline and the high-performance XGBoost Regressor. The focus on robust feature engineering (RFM) and the critical evaluation metric of RMSE ensures the resulting model is highly accurate and directly supports strategic business decisions.

Accurate, Scalable, Business-Oriented CLV Prediction Pipeline