**Abstract**

Barak Capital is one of Israel’s foremost investment groups whose core business is proprietary trading in securities, derivatives and financial instruments.

The goal of this project was to develop machine learning models able to predict short term price changes in a European target instrument, based on signals coming from US assets. The next step was to make it profitable by teaching the models the right time and magnitude of each trade given its signal detection. The hope is that this type of model can generalize to other markets as well.

**Challenges**

- Selecting from a very restricted number of models given the tight constraint on the model inference time vs each signal’s window of opportunity. Neural Networks, among many other computationally intensive models, were not considered.
- Finding the right metric to evaluate the performance of the model considering the % of winners of the model, the magnitude of the profitability of each trade and the frequency of trading of the model which is directly related to the effect of per trade commissions on the PnL.
- Evaluating the strength and probability of success of each prediction.
- Understanding the underlying assets and available data well enough to engineer the right features that help in making these predictions.
- Finding the right balance in the data to train the model with, which directly influenced how good the model was at predicting price movements that are profitable, as well as no movements, which comprise over 90% of the highly imbalanced data.
- Dealing with over 500 million lines of data (Big Data) with the limited computational resources available.
- Developing the entire utility infrastructure for the data exploration/preprocessing, modeling and evaluation/monitoring environments for this new project in development.

**Achievements (according to KPIs)**

Delivered a series of Notebooks that make up the entire pipeline of the project. The notebooks are divided as follows:

- Utility notebook: Containing all the functions and classes used in the data science process of the project.
- Data Exploration notebook: Containing all sorts of visualizations of the data in order to better

understand it. - Hyperparameter optimization notebook: With many deep analyses to find the most optimal hyperparameters while regularizing each model’s overfit.
- Modeling notebook: Contains the whole process of data fetching, data preprocessing, modeling and exporting the models into PMML for deployment into Java. This was done for every type of model implemented such as Linear Regression, Random Forest Regression, XGBoost Regression and more complex ensemble methods.
- Model Interpretation notebook: Many visualizations and analyses for interpreting the results of all the models mentioned above.
- The new models were mainly profitable in the simulations and exceeded the profitability of the vanilla model originally implemented.

**Further development**

There are many steps that can improve this project. The main ones are:

- More feature engineering. There are many features that are harder to implement and that we did not have time to implement, such as short term market speed, correlations, and volatility.
- The use of Neural Network and Deep learning frameworks that have not been tested yet.
- Testing different time spans for the price movements used to train the models.
- Improving outlier removal with complex approaches such as HDBSCAN for further insight into the real non-linear trends of the data.

**Supervisor Feedback**

Yohan is a very talented man, very passionate about data science and enjoys a deep understanding of the mathematical side of ML and the algorithms involved. When presented with the challenge, he jumped straight into the problem at hand, quickly deciphering the meaning of the different parameters and thresholds. He is a very hardworking and organized professional that takes his work very seriously. His disciplined work ethic is evidenced by his constant willingness to put the time and effort necessary to complete the work on time while providing quality. Furthermore, he never shies away from expressing his opinion on every aspect of the task, in a clear manner and with very rational arguments. With his insights, we were able to quickly develop a basic infrastructure of Jupiter notebooks to explore, analyze, model the data and interpret the results in a methodological fashion.

Given the time constraints, we were forced to prioritize and thus chose not to dive into high complexity models (like NN of all kinds), since the simpler regression models provided us with more than enough material for further analysis.

During the short time I enjoyed working with Yohan, we managed to achieve all the original goals set prior to the challenge including exporting the models into a modeling tool independent language (PMML) and reload that same model trained in python, back into java, for inference.