Abstract
Classifying financial documents to their respective categories using the documents’ content and meta data. Extracting textual information from unstructured financial documents.
The project included two sub projects:
- Extraction of information from invoices
The project goal was to extract the supplier’s VAT registration id from invoices.
This task was previously conducted using a paid service, which was achieving poor results and required costly human resources
- Classification of invoices
The project goal was assigning an invoice to the correct budget using their content
Challenges
Extraction Project
- The documents contain multiple identifiers that resemble the target id
- Variety of document and VAT registration id formats
- Available benchmark included mistakes and wasn’t covering the whole data
- Data included coordinates of bounding boxes that were considered for the extraction
- Familiarizing and integrating with the existing codebase
- Adjusting to work with Django
Invoice Classification Project
- Variety of categories, customers, suppliers and invoice formats
- High signal to noise ratio – most of the content was uninformative
- The content was in German, which is less supported by popular NLP tools relative to English
Achievements (according to KPIs)
- VAT registration id extraction
-
-
- The extractor outperformed the benchmark service and is now running in production
- Precision – The extractor matched the benchmark performance, and was correct in many of the cases where there was a disagreement
- Recall – The extractor was effective for many documents that were not covered by the benchmark service
- The extractor outperformed the benchmark service and is now running in production
-
- Invoice classification
-
- Designing a model for a simplified problem, aimed for categories which cover a large proportion of the problem space
- Precision: 0.93
- Recall: 0.8
- Built a pipeline and infrastructure for future models
- Collected a set of German specific NLP tools
- Designing a model for a simplified problem, aimed for categories which cover a large proportion of the problem space
Further development
- Solving the classification task for a higher granularity of categories in order to allow recommendation to human operators, or even complete automation of the task
- Creating more advanced models, e.g.:
- Incorporating more probabilistic properties of the indicative terms
- Working with word embeddings to allow grouping of similar terms and inclusion of out of vocabulary terms