Data Science Fellows Projects 2019

Developing document processing algorithms to reduce operational costs and eliminate human error

Project by Elliot

Abstract

Zeitgold has an app that is meant to replace a bookkeeper for small businesses. To process the documents that a human bookkeeper would, Zeitgold scans in documents such as payroll, invoices and receipts and extracts the necessary parts. Some of the extraction work is still done by people. Classifying financial documents to their respective categories using the documents’ content and meta data. Extracting textual information from unstructured financial documents.

The general objective of this project was to reduce the number of documents and fields that need to be extracted by people, thereby improving the scalability of the Zeitgold app.

 

Challenges

  • Lack of familiarity with the codebase. I ended up writing bits of code with the same functionality as existing code that I could have used instead. I checked in with the project mentor periodically to make sure this would happen less frequently. As the project progressed, I reused more existing functions without needing to be told of their existence.
  • Unexpected behaviour of an API. In a couple of cases, an OCR API that was used did not pick up characters and numbers that were present on the scanned document. I found and fixed the root cause of the issue through debugging. The algorithms written using the output of the API then worked properly.
  • Using functions that did not generalize to my use case. I rewrote the functions, which were meant to generalize, to include my use case.

 

Achievements (according to KPIs)

  • Going from no automation to fully automated extraction of information from payroll documents with close to 100% precision and 100% recall.
  • Improved the automation of certain fields from end-of-day reports, increasing recall by close to 40%.

 

Further development

Full automation of end-of-day reports. With the knowledge of the structure acquired through working with these reports, it would have been nice to tackle more of the fields that need to be extracted.

 

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email