The project consisted of extracting information from medical articles. The information I was assigned to was extract gene variants (mutations) and develop a text classifier that identifies which articles include a functional study (validating the gene behavior in animals, yeast or petri dish models).
- Lack of labeled data. To overcome this I did the following:○ Asked geneticist from the team to help labeling data.
○ Used simpler tools to create easier the labels and understanding of the problem.
○ Used clustering techniques.
- Working in a field unknown to me. To overcome this we had a few genetics “classes” and could ask the professionals anytime.
- Need to benchmark already existing tools. To overcome this I did a comparison with the one I created.
- Understand other bottlenecks. To overcome this I looked into some competitors tools, scraped information and compared them to ours.
Achievements (according to KPIs)
- Improved the variant (mutation) extractor by approximately 20%
- Developed an algorithm that successfully identifies functional studies, that will be implemented in the platform
Other than only extracting information from text, it’s important to improve the extraction of meaning/relations between the entities (Ex: genes and mutation).
Also, it would be from great meaning to organize a community of geneticists to label data so we can train easily ML models with it and get better results.