Text event clustering for finance – using fine-tuned BERT embeddings and Named Entity recognition to define similarity measures

Project by Kevin

Abstract 

The purpose of this project was to build similarity measures and custom clustering algorithms to identify similar sentences in a financial text. We used several approaches to do so : we used embeddings, named entity recognition and naive methods to define sentence similarity and we built a pipeline that enables the company to generate more labelled data for training the classification/clustering algorithms.

Timeline 

First week: build a Flask application that takes 2 input sentences and outputs a Word mover distance graph and a distance heat-map between words in each sentences using fine-tuned BERT embeddings

Second week & Third week: build similarity scores between 2 sentences based on custom Named entity recognition. We build a global score from each of the following entities : DATE, ORG, PERSON, MONEY, FINANCIAL entities

Fourth and Fifth week: build an Active learning process using Typeform API to query labels from the user, retrieve the labels in the database and set rules based on Jaccard or most common words approach to generate next polls of questions.

Challenges & Achievements 

Automated process for sentence labelling: getting a working pipeline that was working for several iterations and generating new batch of similar questions.

Feature engineering: we managed to create custom similarity measures based on embeddings, entities and naives rules such as term frequency in a sentence, clustering with Named entities, Jaccard similarity or cosine similarity measure.

Clustering: text event clustering on financial sentences using BERT embeddings and classical clustering methods.

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email