The goal of the project is to develop a python agent that collects data from machine learning models, calculates metrics and display their evolution in real time. The data monitoring covers the distribution of the input features, the output of the model and the labels if available. The underlying ML model is running in TensorFlow and the type of ML is classification (binary or multiclass).
The agent must be developed so that it should be easy to be deployed into production environment.
- Make an agent generic and easy to integrate in existing production ML algorithm.
- Define time-series style statistics for input, output and label data.
- Adapt to constraints of real time production running algorithm.
- Make the outcome of the agent compliant with Anodot RestAPI format for future integration.
Achievements (according to KPIs)
- Genericity and simplicity.
I designed the agent so that it has a minimum impact on the underlying ML running process. It consists of a light client-agent side that ‘hooks’ the data as they flow inside and outside of the algorithm, and a server-agent side that processes the data in an asynchronous way.
The agent is generic in the sense that it can digest any binary or multiclass classification data in the major ML environments (TensorFlow, sklearn). Its only mandatory input is the predict_proba matrix for each batch of data. Optionally, a user can choose to monitor the inputs distribution and/or the label classical metrics for classification if applicable (accuracy, precision, recall, f1).
- Time-series style statistics
The choice of statistics that can be inferred from the data were designed with Anodot given their experience of monitoring real time data statistics and metrics.
The real time monitoring constraint supposed that flowing data needed to be ‘bucketed’ to compute the metrics of each bucket. The shape and frequency of the buckets has to be chosen be the user to adapt to their production volumetry.
- Integration into running production
I designed the agent so that it can be as safely as possible integrated into a running production environment. On top of simplicity of the client side, I added a server side that computes the metrics of the received buckets asynchronously: the calculation is done in parallel of the ML in production so that it does not slow or interrupt the production. The server can accept multiple agent connections simultaneously.
- Compliance with Anodot RestAPI
The output of the ML monitoring agent is a list of timestamped json files that are compliant with Anodot integration format. A python script is also shipped so that a user can send its monitoring data into Anodot API for anomaly detection analysis, if a valid token is provided.
As the aim of the project is to be open source, I also shipped a Jupyter notebook to do a first level of analysis of the stored statistics.
There’s still space for improvements in the agent, mostly concerning the architecture design of the solution that could be enhanced to make the monitoring even more independent to the production process (api, hosted service). The server side can also be upgraded to be reduce complexity in metrics calculations. I also think that the agent can be easily upgraded to regression ML algorithms monitoring.