Abstract
The purpose of the project is to build a visualization tool for the data science team. Indeed, this dashboard allows users to access several information, such as the database used for the different models, the results, the sample’s details and everything related to the data used in the algorithm implementation. Development of a service that allows a deep analysis on FP (false positive), The purpose of the project is to create a server that displays sample information (id, features used for the training phase, sample source …) in order to clusterize the FP in order to improve predictions and sample analysis.
Challenges
- Understanding basis. The first challenge was to understand the data we are using in order to get from them something useful
- Imagine what could be useful for the users : graphs, boards …
- Developing front end skills. Indeed, I built a dashboard, thus I had to create a server that could be used by everyone in the data science team, back end skills were not enough for that
- Add some intelligence to the tool in order make it for useful : Clustering False Positives
- Improve coding skills. I have learnt to write ‘beautiful code’ by adding some ‘unittests’ and documentation in order to get a code ready for the production
Achievements
- Build a simple dashboard already used by the team
- Able to improve models predictions by analyzing sample per sample throughout the new server
- Detecting some anomalies thanks to this server
Further development
Developing the clustering part that could be very useful for the sample verification. Indeed, CHECKPOINT tries to reduce the amount of False Positive (sample classified as malicious whereas they are not). The clustering part of the project will lead to get different group of False Positive that will allow us to determine the reasons why we misclassified these samples and then try to improve our models