Data Science Fellows Projects 2019

Developing document processing algorithms to reduce operational costs and eliminate human error

Project by Ziv

Abstract

Classifying financial documents to their respective categories using the documents’ content and meta data. Extracting textual information from unstructured financial documents.

The project included two sub projects:

  1. Extraction of information from invoices

The project goal was to extract the supplier’s VAT registration id from invoices.

This task was previously conducted using a paid service, which was achieving poor results and required costly human resources

  1. Classification of invoices

The project goal was assigning an invoice to the correct budget using their content

 

Challenges

Extraction Project

  • The documents contain multiple identifiers that resemble the target id
  • Variety of document and VAT registration id formats
  • Available benchmark included mistakes and wasn’t covering the whole data
  • Data included coordinates of bounding boxes that were considered for the extraction
  • Familiarizing and integrating with the existing codebase
  • Adjusting to work with Django

Invoice Classification Project

  • Variety of categories, customers, suppliers and invoice formats
  • High signal to noise ratio – most of the content was uninformative
  • The content was in German, which is less supported by popular NLP tools relative to English

Achievements (according to KPIs)

  • VAT registration id extraction
      • The extractor outperformed the benchmark service and is now running in production
        • Precision – The extractor matched the benchmark performance, and was correct in many of the cases where there was a disagreement
        • Recall – The extractor was effective for many documents that were not covered by the benchmark service
  • Invoice classification
    • Designing a model for a simplified problem, aimed for categories which cover a large proportion of the problem space
      • Precision: 0.93
      • Recall: 0.8
    • Built a pipeline and infrastructure for future models
    • Collected a set of German specific NLP tools

Further development

  • Solving the classification task for a higher granularity of categories in order to allow recommendation to human operators, or even complete automation of the task
  • Creating more advanced models, e.g.:
    • Incorporating more probabilistic properties of the indicative terms
    • Working with word embeddings to allow grouping of similar terms and inclusion of out of vocabulary terms

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email