The project goal was to POC the ability to predict a supplier shipment volume for the next 3 months for a specific industry. The data set consisted of bill of landing (BOL) details from various countries over the past couple of years. The BOL entries contained information on the exporter and importer as well as classification regarding the type of merchandise and volume.
- Data exploration: Data exploration showed that not all entries had a valid Hs Code assigned to them which helped indicate the merchandise type (mostly data prior to 2015 and for half of 2016). Additionally, the same importer seemed to appear with slightly different name variations.
- Determining which exporters to use: Over 20k different exporters with data between China and the US alone, many also shipped to other countries and had a variety of merchandise types assigned to them.
- Feature Engineering: Attempted to create various features on the exporters based on their shipping data which could be used to either improve the accuracy of the models and/or help segment the exporters into groups that could or could not be predicted.
- Organization & Optimization: There were lots of options to consider when testing the models: how much data to feed, whether to transform the data to make it more stationary, which features to add if applicable, and then the hyperparameters of the models themselves. Needed to try and name the pickled objects that contained the results appropriately so I would remember what they represented additionally next time I would keep a separate file that went over the specifics of each pickled result, I think this also would have better enabled me to be more systematic in my approach.
Achievements (according to KPIs)
For each exporter the relative error was calculated and models were compared based on how many exporters had a relative error within +/- 20%. The baseline model was using the moving average of the past 3,4 or 12 months and 2018 Q3 was predicted. The baseline despite not having great results – around 37% when tested with initial 1k exporter segment – proved tough to beat. The rate increased to around 42% on the same group when they were divided into segments based on model performance and was around 40% for simple RNNs.
Attempting to link the exporters to their Chinese name to then get additional non-shipment related data about the exporters was not attempted. Perhaps linking the two sources could provide with meaningful features that improve the predictive power.