Apply

Please fill out in English

First Name*

Last Name*

Email*

Choose Program*

Academic experience in:(Which of these: Probability & Statistics, Calculus, Linear Algebra or none)

Mobile (Type your number without dashes)*

Country of residence*

utm_campaign

I agree to receive information from Israel Tech ChallengeI agree to receive information from Israel Tech Challenge

First Name

Last Name

utm_campaign

Choose Program*

Preferred Specialization

Mobile (Type your number without dashes)*

Linkedin Address (URL)*

Country of origin*

Country of residence*

Academic Institution*

Academic Degree

Do you have programming knowledge?

How did you hear of US?*

utm_campaign

NLP based mobile apps classification

Project by Olivier

April 14, 2019
, 3:50 pm
, Fellows 2018

Abstract

In a forensic investigation of a mobile device, the examiner encounters numerous unidentified installed applications. A major part in the analysis is to classify apps into several categories that are usually not published as such in the app/play stores and flag the ones which can be relevant to investigate more thoroughly. This would enable a mobile app classifier to understand if an app would be of interest during an investigation, thus saving time and providing the investigators with insightful information.

Challenges

– Manage a large dataset that can barely fit in memory (>5M app descriptions) → use sampling and manipulate only the information needed at the right time

– Be able to retrieve the data easily → store the data in a sqlite database

– Relative size of train and test datasets: Train/Test – 99%/1% → label more data using semi-supervised learning methods (state of the art NLP techniques)

– Definition / no previous knowledge about the categories → redefined the categories accordingly to the product team’s will

– Find a metric to define the “efficiency” of the semi-supervised learning iterative algorithm → sub sample the output and obtained a hyperbola to extrapolate precision

– Difficulty to evaluate recall (given no labeled data) → Estimate recall given a certain distribution of apps

– Avoiding bias as a result of merging 2 datasets (Playstore and iTunes stores) into one, with different considerations and features → Using common features and making sure for example the distributions were similar and if not take it into consideration

– Working on Windows with technologies made for Linux/Mac OS → Come up with other solutions

– Find the right way to represent an app given no context → Use the description as the main feature and encode with 1. BOW models (TFIDF) 2. Word embeddings (Word2Vec, Doc2Vec, FastText…)

– Cleaning the descriptions → Think thoroughly of all the possible cases of noise created within the descriptions.

Achievements (according to KPIs)

Deliverables:

– Jupyter notebook that trains a model based on the dataset and manual annotations that

classifies an app into a set of given categories

– Jupyter notebook that evaluates the accuracy of the model

– Python script for generating a mapping between the app id and the categories. Must support

output formats: CSV, LevelDB

Milestones:

– Mapping of app id to app store categories

– Mapping of app id to our own (forensically interesting) categories

– (Stretch goal) App classification model based on data available in the phone extraction

(permissions, used APIs…)

Further development

– Discover new categories (LSA, LDA,…)

– Improve the description feature encoding

– Tag more data

Please fill out in English

NLP based mobile apps classification

Project by Olivier

Share this post

See more projects

Predicting and Alerting Maternal Emotional States during Pregnancy, Nuvo Cares

Feature engineering for the current Out of stock detection ML model, Trax Retail (Retail Watch team)

Points of Consumption Like You (PLU), WeissBeerger

Dataset2Vec, Explorium