Relationship Extraction & Record Linkage: Finding Relations between Companies, Pipl

Project by Aviv Kadair , Data Science Fellow June 2020


Extracting a relationship between two entities: My project focused on “subsidiary”/ ”own by” relationships, and delivers the names of the named entities with the relationship.  The project included distant supervision as part of the data collection, followed by a bag-of-words model to grossly extract acquisition-related paragraphs. As a final stage, I applied NER and a BERT models on the chosen paragraphs to extract the full relationship, including the corresponding entities. 

Challenges (at least two)

  1. Lack of data: As there is no freely available dataset, I scraped the web for text mentioning tech acquisitions and merges (positive examples) and utilized distant supervision to expand the positive set and create the negative set. 
  2. Dataset imbalance: adjusting the class weights to support imbalance dataset during training.
  3. Extracting only entity-specific relationships: I applied NER process on each tagged paragraph, to ensure only paragraphs mentioning specific entities would be selected and not those talking broadly about acquisitions.


Achievements (according to KPIs)

  1. A bag-of-words model identifying an acquisition which is mentioned in free text
  2. A BERT model which outputs the names of the entities and the subsidiary relationship between them, if existing
  3. Precision rates at 0.9, recall at 0.62

Future project development 

  1. Introducing coreference resolution (entity linking) to extract relationships which are further apart in a paragraph. 
  2.  Improve recall – by increasing the number of available samples, and by testing for different confidence intervals (currently set on 0.85)


Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email