Project by Aviv Kadair , Data Science Fellow June 2020
Extracting a relationship between two entities: My project focused on “subsidiary”/ ”own by” relationships, and delivers the names of the named entities with the relationship. The project included distant supervision as part of the data collection, followed by a bag-of-words model to grossly extract acquisition-related paragraphs. As a final stage, I applied NER and a BERT models on the chosen paragraphs to extract the full relationship, including the corresponding entities.
Challenges (at least two)
- Lack of data: As there is no freely available dataset, I scraped the web for text mentioning tech acquisitions and merges (positive examples) and utilized distant supervision to expand the positive set and create the negative set.
- Dataset imbalance: adjusting the class weights to support imbalance dataset during training.
- Extracting only entity-specific relationships: I applied NER process on each tagged paragraph, to ensure only paragraphs mentioning specific entities would be selected and not those talking broadly about acquisitions.
Achievements (according to KPIs)
- A bag-of-words model identifying an acquisition which is mentioned in free text
- A BERT model which outputs the names of the entities and the subsidiary relationship between them, if existing
- Precision rates at 0.9, recall at 0.62
Future project development
- Introducing coreference resolution (entity linking) to extract relationships which are further apart in a paragraph.
- Improve recall – by increasing the number of available samples, and by testing for different confidence intervals (currently set on 0.85)