Data Science Fellows Projects 2019

NLP based mobile apps classification

Project by Shay

Abstract

Currently, Cellebrite’s technology is able to read most of the content of a locked phone and analyze it. However, the list of the apps installed on the phone is just a list of app ID’s. The goal of this project is to find apps from several categories that might be of interest to the researcher, such as hiding photos and faking phone numbers.

The apps are found based on their description in the app store (the descriptions of all iTunes and play store apps were supplied to us), and a few known examples of each category. We legitimatized and cleaned the descriptions, structured them with tf-idf and applied several supervised and unsupervised models to extract different results. We run the models in a few iterations, adding the strong predictions and manually tagged predictions into our dataset.

Challenges

  • “Needle in a haystack” – we had to find a few thousand apps within a few millions of irrelevant apps. We countered this by using tf-idf, which highlights rare keywords. To avoid overfitting we cleaned the descriptions from too-rare terms, as well as things like URLs and the seller name.
  • A small and biased examples dataset – we started with just 170 examples of “malicious” apps, belonging to 10 different categories. In addition, the dataset was biased because all the apps came from play store and were found based on 1 or 2 search terms. We countered this by tagging more examples from play store and iTunes, using different search terms.
  • No real test set – all of our apps were found using similar methods. And even though we used different search terms to find them, the search was still mitigated by the search algorithm of the app store, whose role is to figure out in what apps we are interested. This together with the small dataset of tagged apps meant we won’t have a real test set. At most we could have taken out some sub-category (based on the store or the search term used to find the example). But this meant introducing bias into our model on purpose. Basically any other split of train and test had too much data leakage.

 

Achievements

  • Quick POC – we managed to prove in the first week that our method will be able to differentiate between the different categories and find new examples. (Based on visualizing the known examples with t-SNE, and of course providing examples of new apps we found)
  • Researching two strategies – supervised and unsupervised – as well as advanced embeddings techniques such as fasttext and doc2vec. Overall we managed to check a significant amount of methods and get a good intuition about the project.
  • Building predictions pipeline, that’s able to get raw data and output the predictions.
  • Presenting the project to Cellebrite’s CEO for an hour and a half, and getting good feedback.

 

Further development

I think that the next step should be using higher resolutions. Instead of predicting the category of a full description, we can split it to paragraphs or sentences and predict them. The logic behind this proposal is that the signal we are looking for is rarely longer than a single sentence, while on the other hand the description might be full with irrelevant text. This will also enable the use of advanced NLP methods such as word2vec and fasttext, since there’ll be much less noise within the positive samples (i.e. the sentences taken from the description that actually say this app does what we are looking for).

 

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email