fbpx

Given a statically linked binary file, determine which Linux libraries it uses and were compiled into it, using TF-IDF similarities

Project by Roy

Abstract

Snyk’s tools help developers automatically find and fix open source vulnerabilities. For languages with built-in package dependency files (python – requirements.txt, javascript – package.json) this is relatively straightforward, however for compiled languages (C/C++, C#) where you only have the resulting binary, this is a significantly harder task.

The project aimed at solving this problem: Given a statically linked binary file, determine with high probability which Linux libraries it uses, and were compiled into it.

We decided to use TF-IDF to find similarities between files, where each document is a collection of strings extracted from a binary compiled file (either executable or library).

Challenges

  • Data pipeline: built from scratch. Stages include:
    • Extraction of files from Ubuntu packages, filtration for binary files only
    • Extraction of strings from said binaries
    • Reading & parsing said strings so they may be fed into TF-IDF model
  • Train TF-IDF model; determine cosine similarity between binaries and libraries; save intermediate results to avoid retraining over and over
  • Size of data: There are a great many Linux packages (~4k), and within them some 120k binary libraries. Total data: 70M strings, of which 8.4M were unique
  • Result measurement: used normalized Discounted Cumulative Gain to measure results. Also devised our own statistic

Achievements (according to KPIs)

Good results when comparing statically compiled binaries (cat, cut, ld, vim…) to Ubuntu-native counterparts.

Mediocre results when trying to discern what libraries were compiled into the binary itself.

Further development 

  • Data preprocessing: we took all strings from binary sources as-is, without filtering any of them out except for the minimum length. This includes many compilation artifacts, and there is much room for improvement to take only relevant strings.
  • Model parameter tuning: TF-IDF has many parameters that can be tuned to play to the strengths of our dataset, alas we didn’t have time to try them out.
  • Feature extraction: it may well be possible to take into account additional features for a more complete solution, such as file path, string length, compilation time etc. to determine which library/version was used for compilation.

 

Supervisor Feedback

Research background – Roy started with understanding the domain problem by reading two relevant research papers. 

Dataset generation – Roy explored the data at hand from different directions, including trying to compile statically few libraries. He ended up identifying a relevant data set that was used as a “ground truth” for the next stages of the project. During that stage, Roy was proactive and independent in the lookup of suitable data to use.

Data preprocessing – During the project, Roy wrote multiple scripts in bash and Python to extract the required information from the files, and to aggregate the different data into a more easy-to-use structure.

Model training and scoring – Roy used Python’s Scikit-Learn library to train a TF-IDF vectorizer. He explored a few ideas around how to optimize the run time and the different parameters of the vectorizer.

Evaluation of the results – We had a brainstorming around how to evaluate the results, deciding to evaluate it as a search problem. Roy visualized the results of the algorithm using a metric he developed.

Demo – Roy demonstrated his project to the engineering team, in a very communicative and clear manner.

Overall – The task of identifying similarities between compiled binary files is challenging. Roy managed to make good progress in his project, demonstrating very good technical and software skills, combined with a great problem-solving attitude. He has good knowledge of different data science techniques and tools. 

I believe he can use the combination of his software skills and data science background to be a great end to end data science engineer.

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email