Apply

Please fill out in English

First Name*

Last Name*

Email*

Choose Program*

Academic experience in:(Which of these: Probability & Statistics, Calculus, Linear Algebra or none)

Mobile (Type your number without dashes)*

Country of residence*

utm_campaign

I agree to receive information from Israel Tech ChallengeI agree to receive information from Israel Tech Challenge

First Name

Last Name

utm_campaign

Choose Program*

Preferred Specialization

Mobile (Type your number without dashes)*

Linkedin Address (URL)*

Country of origin*

Country of residence*

Academic Institution*

Academic Degree

Do you have programming knowledge?

How did you hear of US?*

utm_campaign

Dataset2Vec, Explorium

June 30, 2022
, 11:07 am
, Fellows 2021

Jamie Bamforth

Data Science Fellows February 2021 Cohort

Abstract

Dataset2Vec takes a dataset of any size, shape and builds a fixed-shape numerical characterisation of that
dataset – an embedding. These embeddings act as a standardised representation of each dataset that
provide the foundation for Explorium to improve existing tools and build new tools.

Challenges

1. Building a robust, flexible and modular pipeline to preprocess any dataset

2. Time-to-train – Attempted to automate the initialisation of more powerful AWS EC2 instances to
have multiple training runs happening in parallel, but this ran into DevOps issues, so was not able to
do this. Was only able to train locally as a result, reducing ability to optimise.

3. Evaluation – This is an unsupervised task, so we needed to test on a use-case. These use cases
required their own data science task and required a labelled dataset of datasets. I also had to train
an embedder that is general enough to work as an input to multiple use cases
Achievements (according to KPIs)

1. Unsupervised task, so no direct metrics, but 72% accuracy for prediction of a dataset’s use-case
based on using the dataset embeddings as input.
2. Other achievements: packaged and deployed pipeline that can be installed using pip and allows a
dataset embedding to be produced in 2 lines of code.

Future project development

• Further use case POCs:
o Dataset Enrichment Recommendation
o Auto ML model recommendation from Explorium’s current models
o Client conversion prediction based on their data• Productionisation & deployment of infrastructure to integrate with Explorium’s platform
• Model architecture and input optimisation
• Patent currently being drafted

Please fill out in English

Dataset2Vec, Explorium

Share this post

See more projects

Predicting and Alerting Maternal Emotional States during Pregnancy, Nuvo Cares

Feature engineering for the current Out of stock detection ML model, Trax Retail (Retail Watch team)

Points of Consumption Like You (PLU), WeissBeerger

Predicting the profitability of rental properties (in Amsterdam), Deltika