Apply

Please fill out in English

First Name*

Last Name*

Email*

Choose Program*

Academic experience in:(Which of these: Probability & Statistics, Calculus, Linear Algebra or none)

Mobile (Type your number without dashes)*

Country of residence*

utm_campaign

I agree to receive information from Israel Tech ChallengeI agree to receive information from Israel Tech Challenge

First Name

Last Name

utm_campaign

Choose Program*

Preferred Specialization

Mobile (Type your number without dashes)*

Linkedin Address (URL)*

Country of origin*

Country of residence*

Academic Institution*

Academic Degree

Do you have programming knowledge?

How did you hear of US?*

utm_campaign

NLP based mobile apps classification

Project by Shay

April 14, 2019
, 3:59 pm
, Fellows 2018

Abstract

Currently, Cellebrite’s technology is able to read most of the content of a locked phone and analyze it. However, the list of the apps installed on the phone is just a list of app ID’s. The goal of this project is to find apps from several categories that might be of interest to the researcher, such as hiding photos and faking phone numbers.

The apps are found based on their description in the app store (the descriptions of all iTunes and play store apps were supplied to us), and a few known examples of each category. We legitimatized and cleaned the descriptions, structured them with tf-idf and applied several supervised and unsupervised models to extract different results. We run the models in a few iterations, adding the strong predictions and manually tagged predictions into our dataset.

Challenges

“Needle in a haystack” – we had to find a few thousand apps within a few millions of irrelevant apps. We countered this by using tf-idf, which highlights rare keywords. To avoid overfitting we cleaned the descriptions from too-rare terms, as well as things like URLs and the seller name.
A small and biased examples dataset – we started with just 170 examples of “malicious” apps, belonging to 10 different categories. In addition, the dataset was biased because all the apps came from play store and were found based on 1 or 2 search terms. We countered this by tagging more examples from play store and iTunes, using different search terms.
No real test set – all of our apps were found using similar methods. And even though we used different search terms to find them, the search was still mitigated by the search algorithm of the app store, whose role is to figure out in what apps we are interested. This together with the small dataset of tagged apps meant we won’t have a real test set. At most we could have taken out some sub-category (based on the store or the search term used to find the example). But this meant introducing bias into our model on purpose. Basically any other split of train and test had too much data leakage.

Achievements

Quick POC – we managed to prove in the first week that our method will be able to differentiate between the different categories and find new examples. (Based on visualizing the known examples with t-SNE, and of course providing examples of new apps we found)
Researching two strategies – supervised and unsupervised – as well as advanced embeddings techniques such as fasttext and doc2vec. Overall we managed to check a significant amount of methods and get a good intuition about the project.
Building predictions pipeline, that’s able to get raw data and output the predictions.
Presenting the project to Cellebrite’s CEO for an hour and a half, and getting good feedback.

Further development

I think that the next step should be using higher resolutions. Instead of predicting the category of a full description, we can split it to paragraphs or sentences and predict them. The logic behind this proposal is that the signal we are looking for is rarely longer than a single sentence, while on the other hand the description might be full with irrelevant text. This will also enable the use of advanced NLP methods such as word2vec and fasttext, since there’ll be much less noise within the positive samples (i.e. the sentences taken from the description that actually say this app does what we are looking for).

Please fill out in English

NLP based mobile apps classification

Project by Shay

Share this post

See more projects

Predicting and Alerting Maternal Emotional States during Pregnancy, Nuvo Cares

Feature engineering for the current Out of stock detection ML model, Trax Retail (Retail Watch team)

Points of Consumption Like You (PLU), WeissBeerger

Dataset2Vec, Explorium