- Building a database of +2M US contractors based on their online presence in marketplaces and databases such as Yelp, Google my business, Angie’s List, BBB, etc, using APIs and scraping.
- Cleaning and merging data from multiple data sources to generate a unified contractor profile.
- Generating an offline lifetime value ML model for each contractor, based on their probability to stay in business in the next 1 year / 3 years.
- [Added during the project: Using synthetic data, show that project-based insurance is a more profitable business model than traditional annual insurance]
- It became clear soon after the start of the project that the project aims were ambitious, and perhaps with more time and resources they may have been achievable.
- Lack of available public or proprietary data, in particular historical data in order to predict probability of a contractor staying in business. We were able to access a historic public dataset of businesses in Phoenix, AZ published by Yelp in 2013, de-anonymise them and query the current Yelp database using API to determine if they remained open.
- Once we had constructed our database, we discovered that the dataset was extremely imbalanced; most contractors had remained open over the 5-year period. We explored various techniques (oversampling/SMOTE/feature selection) which enabled us to create a relatively accurate model.
Achievements (according to KPIs)
- A database of 150k contractors based on Yelp API
- A deck with charts describing the competency model
- A deck with charts detailing the project-based business model proof of concept
With more time, we would have
- Explored data sources other than Yelp
- Revisited the database in a year to determine which contractors remained open