Using NLP methods to extract structured and unstructured data from scientific articles

Project by Felipe


The project consisted of extracting information from medical articles. The information I was assigned to was extract gene variants (mutations) and develop a text classifier that identifies which articles include a functional study (validating the gene behavior in animals, yeast or petri dish models).


  • Lack of labeled data. To overcome this I did the following:○ Asked geneticist from the team to help labeling data.

    ○ Used simpler tools to create easier the labels and understanding of the problem.

    ○ Used clustering techniques.

  • Working in a field unknown to me. To overcome this we had a few genetics “classes” and could ask the professionals anytime.
  • Need to benchmark already existing tools. To overcome this I did a comparison with the one I created.
  • Understand other bottlenecks. To overcome this I looked into some competitors tools, scraped information and compared them to ours.

Achievements (according to KPIs)

  • Improved the variant (mutation) extractor by approximately 20%
  • Developed an algorithm that successfully identifies functional studies, that will be implemented in the platform

Further development

Other than only extracting information from text, it’s important to improve the extraction of meaning/relations between the entities (Ex: genes and mutation).

Also, it would be from great meaning to organize a community of geneticists to label data so we can train easily ML models with it and get better results.

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email