Data Quantity: The Decisive Factor in Selecting Your Machine Learning Approach
In my fifth semester, I took a course on Knowledge-Based Engineering, which quickly became my favorite due to its alignment with my interests. For the final project in this course, I had to select one of the topics provided by the lecturer. After reviewing the options, my attention was drawn to the topic, ‘Pengembangan Sistem Explainable AI untuk Prediksi Permintaan Dalam Supply Chain Management’. This topic, focused on forecasting, is particularly intriguing as it addresses a widely researched theme in machine learning, incorporating additional AI approaches. The topic’s package included both datasets and methodologies.
This dataset is a snapshot of a fictional retail landscape, capturing essential attributes that drive retail operations and customer interactions. It includes key details such as Transaction ID, Date, Customer ID, Gender, Age, Product Category, Quantity, Price per Unit, and Total Amount. These attributes enable a multifaceted exploration of sales trends, demographic influences, and purchasing behaviors.
- Transaction ID: A unique identifier for each transaction, allowing tracking and reference.
- Date: The date when the transaction occurred, providing insights into sales trends over time.
- Customer ID: A unique identifier for each customer, enabling customer-centric analysis.
- Gender: The gender of the customer (Male/Female), offering insights into gender-based purchasing patterns.
- Age: The age of the customer, facilitating segmentation and exploration of age-related influences.
- Product Category: The category of the purchased product (e.g., Electronics, Clothing, Beauty), helping understand product preferences.
- Quantity: The number of units of the product purchased, contributing to insights on purchase volumes.
- Price per Unit: The price of one unit of the product, aiding in calculations related to total spending.
- Total Amount: The total monetary value of the transaction, showcasing the financial impact of each purchase.
Each column in this dataset plays a pivotal role in unraveling the dynamics of retail operations and customer behavior. By exploring and analyzing these attributes, you’ll unearth trends, patterns, and correlations that shed light on the complex interplay between customers and products in a retail setting.
Based on the overview of the dataset we knew that this dataset is completely rigid, there is no missing, duplicate values, and valid entirely. So, based on this fact, the EDA should not be too rough and the naive forecasting approach for the first touch can be do quickly.
Then, my idea for the solution based on this type of dataset is to use classical regression because the volume of the dataset was not such big data. So I choose XGBoost with basic optimization techniques, such as hyperparameter tuning and the cross-validation scheme.
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the XGBoost model with hyperparameter tuning using GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000],
'learning_rate': [0.01, 0.1, 0.3],
'max_depth': [3, 5, 7]
}
model = xgb.XGBRegressor(objective='reg:squarederror', eval_metric='rmse')
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)
grid_search.fit(X_train, y_train)
After that, we got the evaluation score below,
The results is absolutely fascinating, so we look above. The R-Score is nearly one. To gain more confidence, we visualize the prediction on a linear graph, below:
We can look at the predicted and actual sales value over year-month that these two values are completely coherent with each other. Furthermore, we can use explainable AI with SHAP to gain insight into which attributes contribute more to the model.
import shap
# Explainable AI with SHAP
explainer = shap.Explainer(best_model)
shap_values = explainer(X_test)
# Plot SHAP values for a single prediction
shap.plots.waterfall(shap_values[0])
# Summary plot of SHAP values for all predi