Abstract
In a forensic investigation of a mobile device, the examiner encounters numerous unidentified installed applications. A major part in the analysis is to classify apps into several categories that are usually not published as such in the app/play stores and flag the ones which can be relevant to investigate more thoroughly. This would enable a mobile app classifier to understand if an app would be of interest during an investigation, thus saving time and providing the investigators with insightful information.
Challenges
– Manage a large dataset that can barely fit in memory (>5M app descriptions) → use sampling and manipulate only the information needed at the right time
– Be able to retrieve the data easily → store the data in a sqlite database
– Relative size of train and test datasets: Train/Test – 99%/1% → label more data using semi-supervised learning methods (state of the art NLP techniques)
– Definition / no previous knowledge about the categories → redefined the categories accordingly to the product team’s will
– Find a metric to define the “efficiency” of the semi-supervised learning iterative algorithm → sub sample the output and obtained a hyperbola to extrapolate precision
– Difficulty to evaluate recall (given no labeled data) → Estimate recall given a certain distribution of apps
– Avoiding bias as a result of merging 2 datasets (Playstore and iTunes stores) into one, with different considerations and features → Using common features and making sure for example the distributions were similar and if not take it into consideration
– Working on Windows with technologies made for Linux/Mac OS → Come up with other solutions
– Find the right way to represent an app given no context → Use the description as the main feature and encode with 1. BOW models (TFIDF) 2. Word embeddings (Word2Vec, Doc2Vec, FastText…)
– Cleaning the descriptions → Think thoroughly of all the possible cases of noise created within the descriptions.
Achievements (according to KPIs)
Deliverables:
– Jupyter notebook that trains a model based on the dataset and manual annotations that
classifies an app into a set of given categories
– Jupyter notebook that evaluates the accuracy of the model
– Python script for generating a mapping between the app id and the categories. Must support
output formats: CSV, LevelDB
Milestones:
– Mapping of app id to app store categories
– Mapping of app id to our own (forensically interesting) categories
– (Stretch goal) App classification model based on data available in the phone extraction
(permissions, used APIs…)
Further development
– Discover new categories (LSA, LDA,…)
– Improve the description feature encoding
– Tag more data