Abstract
Snyk’s tools help developers automatically find and fix open source vulnerabilities. For languages with built-in package dependency files (python – requirements.txt, javascript – package.json) this is relatively straightforward, however for compiled languages (C/C++, C#) where you only have the resulting binary, this is a significantly harder task.
The project aimed at solving this problem: Given a statically linked binary file, determine with high probability which Linux libraries it uses, and were compiled into it.
We decided to use TF-IDF to find similarities between files, where each document is a collection of strings extracted from a binary compiled file (either executable or library).
Challenges
- Data pipeline: built from scratch. Stages include:
-
- Extraction of files from Ubuntu packages, filtration for binary files only
- Extraction of strings from said binaries
- Reading & parsing said strings so they may be fed into TF-IDF model
- Train TF-IDF model; determine cosine similarity between binaries and libraries; save intermediate results to avoid retraining over and over
- Size of data: There are a great many Linux packages (~4k), and within them some 120k binary libraries. Total data: 70M strings, of which 8.4M were unique
- Result measurement: used normalized Discounted Cumulative Gain to measure results. Also devised our own statistic
Achievements (according to KPIs)
Good results when comparing statically compiled binaries (cat, cut, ld, vim…) to Ubuntu-native counterparts.
Mediocre results when trying to discern what libraries were compiled into the binary itself.
Further development
- Data preprocessing: we took all strings from binary sources as-is, without filtering any of them out except for the minimum length. This includes many compilation artifacts, and there is much room for improvement to take only relevant strings.
- Model parameter tuning: TF-IDF has many parameters that can be tuned to play to the strengths of our dataset, alas we didn’t have time to try them out.
- Feature extraction: it may well be possible to take into account additional features for a more complete solution, such as file path, string length, compilation time etc. to determine which library/version was used for compilation.
Supervisor Feedback
Research background – Roy started with understanding the domain problem by reading two relevant research papers.
Dataset generation – Roy explored the data at hand from different directions, including trying to compile statically few libraries. He ended up identifying a relevant data set that was used as a “ground truth” for the next stages of the project. During that stage, Roy was proactive and independent in the lookup of suitable data to use.
Data preprocessing – During the project, Roy wrote multiple scripts in bash and Python to extract the required information from the files, and to aggregate the different data into a more easy-to-use structure.
Model training and scoring – Roy used Python’s Scikit-Learn library to train a TF-IDF vectorizer. He explored a few ideas around how to optimize the run time and the different parameters of the vectorizer.
Evaluation of the results – We had a brainstorming around how to evaluate the results, deciding to evaluate it as a search problem. Roy visualized the results of the algorithm using a metric he developed.
Demo – Roy demonstrated his project to the engineering team, in a very communicative and clear manner.
Overall – The task of identifying similarities between compiled binary files is challenging. Roy managed to make good progress in his project, demonstrating very good technical and software skills, combined with a great problem-solving attitude. He has good knowledge of different data science techniques and tools.
I believe he can use the combination of his software skills and data science background to be a great end to end data science engineer.