Customer Lifetime Value Prediction – Big Data Analytics Project

1. Problem Identification

This use case is critical because it allows the business to move beyond treating all customers equally, instead focusing efforts on those identified as the most valuable.

Use Case Selected

We selected Customer Lifetime Value (CLV) Prediction from the retail analytics domain.

Business Goal

The aim is to use data analysis to identify high-value customers by predicting their future spending patterns.

Strategic Action

Prediction enables targeted marketing offers, prioritized sales efforts, and proactive customer service to improve retention for key customers.

Project Scope

Design a scalable pipeline that converts raw transaction history into a predicted numerical CLV score for each customer.

2. Data Sourcing

The chosen dataset is Online Retail II, a publicly available, real-world transactional dataset well-suited for a big data application.

Metadata Summary

Dataset Name: Online Retail II

Total Rows: 1,067,371 transaction records

Total Features: 8 raw features

Time Span: Almost two years

Source: UCI Machine Learning Repository

Dataset Description

It contains detailed transaction records from a UK-based online retail store. Its large volume makes it ideal for calculating RFM metrics (Recency, Frequency, Monetary Value) necessary for CLV prediction.

View Kaggle Dataset

3. Pipeline Design

We are designing a robust, scalable pipeline suitable for large, transactional data that facilitates subsequent machine learning tasks.

1. Data Ingestion

Batch processing using Apache Spark to load raw CSV files from HDFS or AWS S3 (distributed reading).

2. Storage Layer

Raw data in HDFS/S3 (Data Lake). Processed, aggregated data in Parquet columnar format for optimized reading speed.

3. Processing (ETL)

Distributed ETL using PySpark. Includes cleaning (removing missing IDs, negative values) and aggregating data to create RFM features.

4. Analytics & Modeling

Training regression models (primarily XGBoost Regressor) on feature-engineered data, with Linear Regression as a baseline.

5. Visualization Layer

Dashboards using Tableau or Power BI to present CLV predictions and customer segments for stakeholders.

Full Architectural Diagram is included in the Appendix of the PDF submission.

4. Machine Learning Methodology

Pre-processing Strategy

Handling Missing Values: Rows with missing `CustomerID` are removed, and invalid records (negative quantity/price) are also removed.
Categorical Variables: The `Country` variable uses One-Hot Encoding because it is a nominal (non-ordered) variable.
Feature Engineering: The creation of RFM features (Recency, Frequency, Monetary) is the fundamental step for CLV.
Scaling Strategy: Standardization (Z-score scaling) is applied to all numerical features to improve training stability for gradient boosting models.

Algorithm Recommendation: XGBoost Regressor

Recommended Model: XGBoost Regressor (Gradient Boosted Decision Tree)

Justification:

Problem Type: CLV is a regression task that predicts a continuous future spending value.
Data Suitability: XGBoost excels on structured/tabular data common in retail analytics.
Robustness: It is highly robust to outliers and skewed distributions, which are common in CLV data.

Dataset Analysis (Online Retail II)

Time-Series Nature

Data is structured, multivariate, and time-series in nature, requiring time-based feature calculation and modeling.

High Imbalance

The target CLV variable is highly imbalanced and right-skewed. This is handled effectively by XGBoost.

Dimensionality

Raw data is low-dimensional, but feature engineering creates a medium-dimensional dataset (15-25 features).

5. Implementation Plan

Library Selection

Data Processing: pyspark (large-scale ETL and aggregation) and pandas (manipulation)
Modeling: xgboost (main regressor) and scikit-learn (scaling/splitting)
Visualization: matplotlib and seaborn (feature importance/diagnostics)

Pseudo-Code / Logic Flow


1. LOAD raw transaction data into a PySpark DataFrame.
2. CLEAN data: Remove canceled invoices, missing CustomerID, or invalid values.
3. CREATE customer-level features: Group by CustomerID and compute RFM features.
4. DEFINE TARGET (CLV): Calculate future total spend as the CLV label.
5. PRE-PROCESS: Apply One-Hot Encoding and StandardScaler to features.
6. SPLIT the feature set into training (80%) and testing (20%) sets.
7. TRAIN the main model: Fit the XGBoost Regressor on the training set.
8. EVALUATE model: Predict CLV on the test set and calculate evaluation metrics.
9. INTERPRET results: Analyze feature importance from XGBoost.
10. DEPLOY: Save the final trained model for production use.

6. Evaluation Metrics

Since CLV is a regression problem, metrics that measure the error magnitude in currency terms are required.

RMSE (Primary Metric)

Root Mean Squared Error. It is the most critical metric because it heavily penalizes large prediction errors. Minimizing large errors on high-value customers is the top business priority.

MAE

Mean Absolute Error. Provides an easily interpretable average error in currency units, useful for business reporting.

R²

R-squared (Coefficient of Determination). Explains the percentage of variance in customer value that the model accounts for.

7. Conclusion

This proposal details a comprehensive, scalable Customer Lifetime Value prediction solution utilizing a PySpark-based pipeline and the high-performance XGBoost Regressor. The focus on robust feature engineering (RFM) and the critical evaluation metric of RMSE ensures the resulting model is highly accurate and directly supports strategic business decisions.

Accurate, Scalable, Business-Oriented CLV Prediction Pipeline

1. Problem Identification

This use case is critical because it allows the business to move beyond treating all customers equally, instead focusing efforts on those identified as the most valuable.

Use Case Selected

We selected Customer Lifetime Value (CLV) Prediction from the retail analytics domain.

Business Goal

The aim is to use data analysis to identify high-value customers by predicting their future spending patterns.

Strategic Action

Prediction enables targeted marketing offers, prioritized sales efforts, and proactive customer service to improve retention for key customers.

Project Scope

Design a scalable pipeline that converts raw transaction history into a predicted numerical CLV score for each customer.

2. Data Sourcing

The chosen dataset is Online Retail II, a publicly available, real-world transactional dataset well-suited for a big data application.

Metadata Summary

Dataset Name: Online Retail II

Total Rows: 1,067,371 transaction records

Total Features: 8 raw features

Time Span: Almost two years

Source: UCI Machine Learning Repository

Dataset Description

It contains detailed transaction records from a UK-based online retail store. Its large volume makes it ideal for calculating RFM metrics (Recency, Frequency, Monetary Value) necessary for CLV prediction.

View Kaggle Dataset

3. Pipeline Design

We are designing a robust, scalable pipeline suitable for large, transactional data that facilitates subsequent machine learning tasks.

1. Data Ingestion

Batch processing using Apache Spark to load raw CSV files from HDFS or AWS S3 (distributed reading).

2. Storage Layer

Raw data in HDFS/S3 (Data Lake). Processed, aggregated data in Parquet columnar format for optimized reading speed.

3. Processing (ETL)

Distributed ETL using PySpark. Includes cleaning (removing missing IDs, negative values) and aggregating data to create RFM features.

4. Analytics & Modeling

Training regression models (primarily XGBoost Regressor) on feature-engineered data, with Linear Regression as a baseline.

5. Visualization Layer

Dashboards using Tableau or Power BI to present CLV predictions and customer segments for stakeholders.

End-to-End Architecture Diagram

A visual blueprint of the Big Data and Machine Learning flow.

4. Machine Learning Methodology

Pre-processing Strategy

Handling Missing Values: Rows with missing CustomerID are removed, and invalid records (negative quantity/price) are also removed.
Categorical Variables: The Country variable uses One-Hot Encoding because it is a nominal (non-ordered) variable.
Feature Engineering: The creation of RFM features (Recency, Frequency, Monetary) is the fundamental step for CLV.
Scaling Strategy: Standardization (Z-score scaling) is applied to all numerical features to improve training stability for gradient boosting models.

Algorithm Recommendation: XGBoost Regressor

Recommended Model: XGBoost Regressor (Gradient Boosted Decision Tree)

Justification:

Problem Type: CLV is a regression task that predicts a continuous future spending value.
Data Suitability: XGBoost excels on structured/tabular data common in retail analytics.
Robustness: It is highly robust to outliers and skewed distributions, which are common in CLV data.

Dataset Analysis (Online Retail II)

Time-Series Nature

Data is structured, multivariate, and time-series in nature, requiring time-based feature calculation and modeling.

High Imbalance

The target CLV variable is highly imbalanced and right-skewed. This is handled effectively by XGBoost.

Dimensionality

Raw data is low-dimensional, but feature engineering creates a medium-dimensional dataset (15-25 features).

5. Implementation Plan

Library Selection

Data Processing: pyspark (large-scale ETL and aggregation) and pandas (manipulation)
Modeling: xgboost (main regressor) and scikit-learn (scaling/splitting)
Visualization: matplotlib and seaborn (feature importance/diagnostics)

Pseudo-Code / Logic Flow


1. LOAD raw transaction data into a PySpark DataFrame.
2. CLEAN data: Remove canceled invoices, missing CustomerID, or invalid values.
3. CREATE customer-level features: Group by CustomerID and compute RFM features.
4. DEFINE TARGET (CLV): Calculate future total spend as the CLV label.
5. PRE-PROCESS: Apply One-Hot Encoding and StandardScaler to features.
6. SPLIT the feature set into training (80%) and testing (20%) sets.
7. TRAIN the main model: Fit the XGBoost Regressor on the training set.
8. EVALUATE model: Predict CLV on the test set and calculate evaluation metrics.
9. INTERPRET results: Analyze feature importance from XGBoost.
10. DEPLOY: Save the final trained model for production use.

6. Evaluation Metrics

Since CLV is a regression problem, metrics that measure the error magnitude in currency terms are required.

RMSE (Primary Metric)

Root Mean Squared Error. It is the most critical metric because it heavily penalizes large prediction errors. Minimizing large errors on high-value customers is the top business priority.

MAE

Mean Absolute Error. Provides an easily interpretable average error in currency units, useful for business reporting.

R²

R-squared (Coefficient of Determination). Explains the percentage of variance in customer value that the model accounts for.

7. Conclusion

This proposal details a comprehensive, scalable Customer Lifetime Value prediction solution utilizing a PySpark-based pipeline and the high-performance XGBoost Regressor. The focus on robust feature engineering (RFM) and the critical evaluation metric of RMSE ensures the resulting model is highly accurate and directly supports strategic business decisions.

Accurate, Scalable, Business-Oriented CLV Prediction Pipeline

Presented By

Lyaba Farooq

Muskan Asad

Yusra Khan

1. Problem Identification

Use Case Selected

Business Goal

Strategic Action

Project Scope

2. Data Sourcing

Metadata Summary

Dataset Description

3. Pipeline Design

1. Data Ingestion

2. Storage Layer

3. Processing (ETL)

4. Analytics & Modeling

5. Visualization Layer

4. Machine Learning Methodology

Pre-processing Strategy

Algorithm Recommendation: XGBoost Regressor

Dataset Analysis (Online Retail II)

Time-Series Nature

High Imbalance

Dimensionality

5. Implementation Plan

Library Selection

Pseudo-Code / Logic Flow

6. Evaluation Metrics

RMSE (Primary Metric)

MAE

R²

7. Conclusion

Presented By

Lyaba Farooq

Muskan Asad

Yusra Khan

1. Problem Identification

Use Case Selected

Business Goal

Strategic Action

Project Scope

2. Data Sourcing

Metadata Summary

Dataset Description

3. Pipeline Design

1. Data Ingestion

2. Storage Layer

3. Processing (ETL)

4. Analytics & Modeling

5. Visualization Layer

End-to-End Architecture Diagram

4. Machine Learning Methodology

Pre-processing Strategy

Algorithm Recommendation: XGBoost Regressor

Dataset Analysis (Online Retail II)

Time-Series Nature

High Imbalance

Dimensionality

5. Implementation Plan

Library Selection

Pseudo-Code / Logic Flow

6. Evaluation Metrics

RMSE (Primary Metric)

MAE

R²

7. Conclusion