The purpose of this project was to build similarity measures and custom clustering algorithms to identify similar sentences in a financial text. We used several approaches to do so : we used embeddings, named entity recognition and naive methods to define sentence similarity and we built a pipeline that enables the company to generate more labelled data for training the classification/clustering algorithms.
First week: build a Flask application that takes 2 input sentences and outputs a Word mover distance graph and a distance heat-map between words in each sentences using fine-tuned BERT embeddings
Second week & Third week: build similarity scores between 2 sentences based on custom Named entity recognition. We build a global score from each of the following entities : DATE, ORG, PERSON, MONEY, FINANCIAL entities
Fourth and Fifth week: build an Active learning process using Typeform API to query labels from the user, retrieve the labels in the database and set rules based on Jaccard or most common words approach to generate next polls of questions.
Challenges & Achievements
– Automated process for sentence labelling: getting a working pipeline that was working for several iterations and generating new batch of similar questions.
– Feature engineering: we managed to create custom similarity measures based on embeddings, entities and naives rules such as term frequency in a sentence, clustering with Named entities, Jaccard similarity or cosine similarity measure.
– Clustering: text event clustering on financial sentences using BERT embeddings and classical clustering methods.