Deep Reinforcement Learning Research for Real-Time Malware Prevention

Project by Jeremy


Deep reinforcement learning is a family of algorithms allowing to train agents to make optimal decisions in dynamic, high dimensional environments. The underlying theory relies on Markovian Decision Processes. A policy function (Pi), or a future reward function (Q), are estimated via training one or more neural networks during multiple episodes with a degree of randomness in decisions.

In this project, we adapted 3 algorithms: DQN, Reinforce and Actor-Critic, to perform real-time malware prevention, both on a Deterministic and a Stochastic environment. The data provided was in the form of vectors of integers, each corresponding to an api call that was performed by a program.



  • The data didn’t  respect the Markovian process assumption, and we had to instead consider implementation tricks relative to Partially Observable Markovian Processes (POMDP), like stacking a history of states.
  • We had to define the action space of the agent as well as the reward function according to the needs of the project.
  • Make a sequence of actions without a natural feedback provided.


Achievements (according to KPIs)

  • Implemented all 3 algorithms.
  • Defined and Implemented visualization metrics allowing comparison and improvements of the models.


Further development

In the DQN algorithm, instead of a fully connected neural network, we suggest using a memory based neural network (LSTM) as was done by M. Hausknecht and P. Stone (

Implement a variation of the Observation wrapper where only one vector contains the aggregated history of all API calls for a given file and another vector contains the last x apis.

Share this post

Share on facebook
Share on twitter
Share on linkedin
Share on email