Using scrapped data and public API to predict small business closure

Project by Richard

Abstract

  • Building a database of +2M US contractors based on their online presence in marketplaces and databases such as Yelp, Google my business, Angie’s List, BBB, etc, using APIs and scraping.
  • Cleaning and merging data from multiple data sources to generate a unified contractor profile.
  • Generating an offline lifetime value ML model for each contractor, based on their probability to stay in business in the next 1 year / 3 years.
  • [Added during the project: Using synthetic data, show that project-based insurance is a more profitable business model than traditional annual insurance]

 

Challenges

  • It became clear soon after the start of the project that the project aims were ambitious, and perhaps with more time and resources they may have been achievable.
  • Lack of available public or proprietary data, in particular historical data in order to predict probability of a contractor staying in business. We were able to access a historic public dataset of businesses in Phoenix, AZ published by Yelp in 2013, de-anonymise them and query the current Yelp database using API to determine if they remained open.
  • Once we had constructed our database, we discovered that the dataset was extremely imbalanced; most contractors had remained open over the 5-year period. We explored various techniques (oversampling/SMOTE/feature selection) which enabled us to create a relatively accurate model.

 

Achievements (according to KPIs)

  • A database of 150k contractors based on Yelp API
  • A deck with charts describing the competency model
  • A deck with charts detailing the project-based business model proof of concept

 

Further development

With more time, we would have

  • Explored data sources other than Yelp
  • Revisited the database in a year to determine which contractors remained open

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email