Apply

Please fill out in English

First Name*

Last Name*

Email*

Choose Program*

Academic experience in:(Which of these: Probability & Statistics, Calculus, Linear Algebra or none)

Mobile (Type your number without dashes)*

Country of residence*

utm_campaign

I agree to receive information from Israel Tech ChallengeI agree to receive information from Israel Tech Challenge

First Name

Last Name

utm_campaign

Choose Program*

Preferred Specialization

Mobile (Type your number without dashes)*

Linkedin Address (URL)*

Country of origin*

Country of residence*

Academic Institution*

Academic Degree

Do you have programming knowledge?

How did you hear of US?*

utm_campaign

Developing document processing algorithms to reduce operational costs and eliminate human error

Project by Ziv

April 14, 2019
, 7:47 am
, Fellows 2018

Abstract

Classifying financial documents to their respective categories using the documents’ content and meta data. Extracting textual information from unstructured financial documents.

The project included two sub projects:

Extraction of information from invoices

The project goal was to extract the supplier’s VAT registration id from invoices.

This task was previously conducted using a paid service, which was achieving poor results and required costly human resources

Classification of invoices

The project goal was assigning an invoice to the correct budget using their content

Challenges

Extraction Project

The documents contain multiple identifiers that resemble the target id
Variety of document and VAT registration id formats
Available benchmark included mistakes and wasn’t covering the whole data
Data included coordinates of bounding boxes that were considered for the extraction
Familiarizing and integrating with the existing codebase
Adjusting to work with Django

Invoice Classification Project

Variety of categories, customers, suppliers and invoice formats
High signal to noise ratio – most of the content was uninformative
The content was in German, which is less supported by popular NLP tools relative to English

Achievements (according to KPIs)

VAT registration id extraction
- - The extractor outperformed the benchmark service and is now running in production
    - Precision – The extractor matched the benchmark performance, and was correct in many of the cases where there was a disagreement
    - Recall – The extractor was effective for many documents that were not covered by the benchmark service
Invoice classification
- Designing a model for a simplified problem, aimed for categories which cover a large proportion of the problem space
  - Precision: 0.93
  - Recall: 0.8
- Built a pipeline and infrastructure for future models
- Collected a set of German specific NLP tools

Further development

Solving the classification task for a higher granularity of categories in order to allow recommendation to human operators, or even complete automation of the task
Creating more advanced models, e.g.:
- Incorporating more probabilistic properties of the indicative terms
- Working with word embeddings to allow grouping of similar terms and inclusion of out of vocabulary terms

Please fill out in English

Developing document processing algorithms to reduce operational costs and eliminate human error

Project by Ziv

Share this post

See more projects

Predicting and Alerting Maternal Emotional States during Pregnancy, Nuvo Cares

Feature engineering for the current Out of stock detection ML model, Trax Retail (Retail Watch team)

Points of Consumption Like You (PLU), WeissBeerger

Dataset2Vec, Explorium