Data Science Fellows February 2021 Cohort
Dataset2Vec takes a dataset of any size, shape and builds a fixed-shape numerical characterisation of that
dataset – an embedding. These embeddings act as a standardised representation of each dataset that
provide the foundation for Explorium to improve existing tools and build new tools.
1. Building a robust, flexible and modular pipeline to preprocess any dataset
2. Time-to-train – Attempted to automate the initialisation of more powerful AWS EC2 instances to
have multiple training runs happening in parallel, but this ran into DevOps issues, so was not able to
do this. Was only able to train locally as a result, reducing ability to optimise.
3. Evaluation – This is an unsupervised task, so we needed to test on a use-case. These use cases
required their own data science task and required a labelled dataset of datasets. I also had to train
an embedder that is general enough to work as an input to multiple use cases
Achievements (according to KPIs)
1. Unsupervised task, so no direct metrics, but 72% accuracy for prediction of a dataset’s use-case
based on using the dataset embeddings as input.
2. Other achievements: packaged and deployed pipeline that can be installed using pip and allows a
dataset embedding to be produced in 2 lines of code.
Future project development
• Further use case POCs:
o Dataset Enrichment Recommendation
o Auto ML model recommendation from Explorium’s current models
o Client conversion prediction based on their data• Productionisation & deployment of infrastructure to integrate with Explorium’s platform
• Model architecture and input optimisation
• Patent currently being drafted