This project predicted daily sales for food categories at Walmart stores in California with the input data (06/19/2015–06/19/2016) covered item ids, item sales, item prices, departments, product categories, store ids and holiday/special events.
- Investigated the impact of holidays on daily sales
- Explored trends for each state, category, stores
- Trained ARIMA, Decision Tree, Random Forest and Light GBM
Impacts of Holidays
Before starting refine this project, I believed people started storing goods a few days before holidays/special events and might keep shopping a few days afterwards, and therefore I planned to make a change on feature engineering by viewing days around holidays as “special” too. This is not the case.
The above image shows the aggregated sum of total sales for Hobbies_1 category at a CA Walmart store (store id: CA_1). You can tell holidays overlapped with both local minima and local maxima. Furthermore, there’s seasonality and I don’t know if the local extreme values were just part of seasonality or the real impact of holidays, or both? This observation is confirmed by zooming into different part of the above graph:
The above graph zooms into the right most part of the first graph. It shows the most recent sales records. This zoomed-in graph confirms that both local maxima and local minima happened on sales during holidays. When holidays were very close, their total daily sales were close, too. Therefore, I decided to quit the original plan of viewing the days around holidays also as holidays and rely more on seasonality analysis.
Similar and Dissimilar Trend
With a massive dataset of sale records of each item per day at each Walmart store for one year, obviously I had to make a choice on the level of this prediction task. Not all levels are possible, according to my work half a year ago: predicting daily sale of each item with basic machine learning algorithms is just not proper, because of the sparsity of data (lots of items were not sold on many days), also because of the lack of pattern on item sales. What can I do with this dataset then?
By plotting the trend of daily total sales at state, category and store-level, I chose the category-level as the basis to build models on, because each trend of three categories was clearly separable from one another and had obvious seasonality:
In the above graph, dashed vertical lines represent holidays/special events between Thanksgiving and Christmas in 2015. You can tell that there’s a significant decrease during the Christmas, especially for the food category.
Half a year ago, the only model I was able to train using the original massive dataset was LightGBM because of the long runtime. This time, by looking at the category level (foods, households, hobbies), I was able to explore more options. I focused on Foods category.
The classic ARIMA came to my mind, because extra variables, such as holidays, turned out to be not as useful as they were supposed to be. I first used Dickey-Fuller Test to check the differencing order. It turned out only the time series of Hobbies category was stationary (p-value 0.05). All the others (Foods, Households) needed to have positive differencing orders. Which order? I created first-differencing, second-differencing and third-differencing variables and figured out that p-value jumped up and down in a cycle. This led me to train ARIMA only on the Hobbies category without spending too much time on more differencing orders. The below is the ACF and PACF plots. PACF plot was pretty ideal: there’s a cutoff at 1 or 2 (setting p=1 or 2 in ARIMA(p, d, q) gives similar results); however, ACF was the problem. Starting from the first lag, ACF plot was either decreasing gradually or having a sharp cutoff.
As guessed, predictive results of ARIMA were not good:
Does SARIMA (ARIMA with seasonality) improve prediction at all? No! As we don’t have a clean series with ideal Autoregressive orders, SARIMA couldn’t help and generated similar forecasts.
- Tree Algorithms
I love forests. They’re forgiving. You just throw a lot of features to them and they will split nodes at the appropriate points then generate good results for you. In this case, I created first lags and second lags of sales then threw them with other features all together.
Not surprisingly, Random Forest outperformed Decision Tree (0.25 vs 0.27 mean absolute percentage error). It’s not surprising because Random Forest is the bagged version of aggregated Decision Tree — basically an upgraded version built upon a single tree. I used mean absolute percentage error (MAPE) because I want to know the relative difference between predicted values and true values compared to the true values. The only reason I didn’t use the popular root-mean-square-error (RMSE) was that daily sales fall into a range with seasonality — there’s not really an extreme large or small predicted value to worry about.
Finally, I tuned hyperparameters of Light Gradient Boosting Tree. Its performance was very similar to Random Forest’s result (both 0.25 MAPE). LightGBM is not my favorite model, because you have to tune way more hyperparameters. I do love the idea of growing a tree at leaf-wise and focusing on the errors of the previous stage though!
Thanks for reading :)