NLP based mobile apps classification

Project by Olivier

Abstract

In a forensic investigation of a mobile device, the examiner encounters numerous unidentified installed applications. A major part in the analysis is to classify apps into several categories that are usually not published as such in the app/play stores and flag the ones which can be relevant to investigate more thoroughly. This would enable a mobile app classifier to understand if an app would be of interest during an investigation, thus saving time and providing the investigators with insightful information.

 

Challenges

– Manage a large dataset that can barely fit in memory (>5M app descriptions) use sampling and manipulate only the information needed at the right time

– Be able to retrieve the data easily store the data in a sqlite database

– Relative size of train and test datasets: Train/Test – 99%/1% label more data using semi-supervised learning methods (state of the art NLP techniques)

– Definition / no previous knowledge about the categories redefined the categories accordingly to the product team’s will

– Find a metric to define the “efficiency” of the semi-supervised learning iterative algorithm sub sample the output and obtained a hyperbola to extrapolate precision

– Difficulty to evaluate recall (given no labeled data) Estimate recall given a certain distribution of apps

– Avoiding bias as a result of merging 2 datasets (Playstore and iTunes stores) into one, with different considerations and features Using common features and making sure for example the distributions were similar and if not take it into consideration

– Working on Windows with technologies made for Linux/Mac OS Come up with other solutions

– Find the right way to represent an app given no context Use the description as the main feature and encode with 1. BOW models (TFIDF) 2. Word embeddings (Word2Vec, Doc2Vec, FastText…)

– Cleaning the descriptions Think thoroughly of all the possible cases of noise created within the descriptions.

 

Achievements (according to KPIs)

Deliverables:

– Jupyter notebook that trains a model based on the dataset and manual annotations that

classifies an app into a set of given categories

– Jupyter notebook that evaluates the accuracy of the model

– Python script for generating a mapping between the app id and the categories. Must support

output formats: CSV, LevelDB

 

Milestones:

– Mapping of app id to app store categories

– Mapping of app id to our own (forensically interesting) categories

– (Stretch goal) App classification model based on data available in the phone extraction

(permissions, used APIs…)

 

Further development

– Discover new categories (LSA, LDA,…)

– Improve the description feature encoding

– Tag more data

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email