Abstract
Deep reinforcement learning is a family of algorithms allowing to train agents to make optimal decisions in dynamic, high dimensional environments. The underlying theory relies on Markovian Decision Processes. A policy function (Pi), or a future reward function (Q), are estimated via training one or more neural networks during multiple episodes with a degree of randomness in decisions.
In this project, we adapted 3 algorithms: DQN, Reinforce and Actor-Critic, to perform real-time malware prevention, both on a Deterministic and a Stochastic environment. The data provided was in the form of vectors of integers, each corresponding to an api call that was performed by a program.
Challenges
- The data didn’t respect the Markovian process assumption, and we had to instead consider implementation tricks relative to Partially Observable Markovian Processes (POMDP), like stacking a history of states.
- We had to define the action space of the agent as well as the reward function according to the needs of the project.
- Make a sequence of actions without a natural feedback provided.
Achievements (according to KPIs)
- Implemented all 3 algorithms.
- Defined and Implemented visualization metrics allowing comparison and improvements of the models.
Further development
In the DQN algorithm, instead of a fully connected neural network, we suggest using a memory based neural network (LSTM) as was done by M. Hausknecht and P. Stone (https://arxiv.org/pdf/1507.06527.pdf)
Implement a variation of the Observation wrapper where only one vector contains the aggregated history of all API calls for a given file and another vector contains the last x apis.