Project by Dana Makov
Data Science Fellows June 2020 Cohort
Abstract
Kando provides wastewater intelligence via its network of IoT devices which supply constant data on network health. This provides a wealth of time-series data mined by the data science team for predictions, motif and anomaly detection, event fingerprinting, etc. The project involved testing a few algorithms to understand the underlying “grammar” of the wastewater, by using mining time-series data and analyzing it to find small motifs or anomalous signatures.
In order to find the motifs and anomalies, we trained VAE (Variational Autoencoders) to learn the representation of the data in a lower-dimensional space, and use common anomaly detection tools such as Isolation Forest and Extended Isolation Forest to seek candidates appearing in regular and lower-dimensional spaces. In addition, we calculated utilize Matrix Profile (again in multiple vector spaces) and use this as a further validation tool for finding and clustering relevant motifs and anomalies. We created a graphical UI to allow a broad range of company employees to use it for exploring our findings.
Challenges (at least two)
- A lot of the information we needed did not exist in the company, and we had to spend a lot of time preparing and processing the information before building the algorithms.
- Remote work did not really allow people to get to know each other, so we tried to go to the company offices as much as possible, and only when we met face to face we could get help and more guidance on the problems we had.
Achievements (according to KPIs)
- We were able to produce a visualization of the anomalies according to the different algorithms so that they could be compared.
-
- We found motifs in different places so that you can get for each motif we find the dates and times it occurred, and see the values of all the sensors so that professionals can learn and try to tag recurring cases and understand what the different graph shapes mean and reasons.
-
- We created an interactive map with the relevant data for non-developers to use.