Come take a look at the innovative ideas our students chose to demonstrate in their final projects for the Fellows Data Science program!
Every cohort the students team up to work on their final Data Science project, this is where everything they’ve learned in the program comes together to a practical demonstration.
We’ve gathered the project descriptions and demonstrations from the cohort of February:
Classifying cities by images using Convolutional Neural Networks
Dor Meir, Sagi Elfassi, Daniel Saban, Arad Ben Haim, Itamar Bergfreund
We trained our network on several network architectures, used transfer learning, augmented our dataset in different ways, built auto-encoders for outlier detection and used lime for interpreting the classifier decision visually. The classifier is doing a good job especially in suburbs and highways of cities, where it is more difficult for a human to distinguish between different cities. We’ve got great success on the Mapillary dataset – 93% accuracy on 19 classes on pictures that the model has never seen before. We’ve Partial success on out of dataset images – we are estimating around 80% on pictures from google street view on unseen images from suburbs and highways, and much lower on city centers (differs with each city).
Quora Question Pairs
Detecting duplicate questions from Quora dataset
Mariia Padalko, Anna Roitberg, Galina Blokh, Vladimir Gurevich
Various platforms, like Quora or SO or chat-bots need to find duplicating questions in order to optimize resources. Also, some services need to find the closest questions and to recommend them to the user. We deployed the model that provides the solution! We used data with >400K of question pairs to train our models to predict if a given pair of questions is duplicate. The preprocessing part included lemmatization (optional), stop-words removing and embedding (based on training corpora). Based on this data we implemented new features: distance-based pairwise features and features for a single question.
Next, we considered some classic ML algorithms like Random Forest and Boosting methods on initial and balanced data (using SMOTE and weights) and reached accuracy up to 84%. Besides that, we tested some NNs with different architecture and got accuracy up to 85%. These models allowed us to create an application that not only predicts duplicates but also gives a list of closest questions to the user’s input.
Classifying a birds’ species given its image
Matan Feldman, Or Matalon, Dror Rosentraub, Amit Feldman
The project’s main goal was classifying a bird’s species given its image, out of 200 possible species. Using transfer learning of state-of-the-art CNN architectures, an accuracy of over 94% was obtained. The model was implemented within an application which allows the users identifying birds’ species, as well as provides additional information regarding it.
Predict the last note of a short musical segment
Nissim Hefetz, Yoav Vollansky, Shachar Mauda, Tsofit Zohar
Generating music is not an easy task and we have tried nonetheless. We first aim at a relatively easy and small sub-problem in order to iterate on it and fail-fast. Our problem definition was to train a model to predict the last note of a short musical segment.
After predicting the last note of the melody, our model can then be loop-fed its prediction in order to predict the next note. We have used LSTM trained on ~2M short classical music segments in MIDI format. When trying to predict the ending note of a complex classical music segment, the model general accuracy is 0.17 on a baseline of 0.007. When evaluated by ear, the final note doesn’t sound harmonious, though when we try to predict simple melodies, such as children’s songs, it comes very close. This is also true when trying to iterate over the generated samples and create longer melodies.
Terrorism Risk Prediction
Or Gindes, Ariel Holin, Tomer Porat and Dor Sklar
Security forces around the world would like to use their existing resources to the fullest, which can be achieved by prioritizing different terror attacks.
We’ve built a terrorism risk prediction model that is used as an evaluation tool: when given intelligence details that can be known beforehand about a possible attack
(for example, if the perpetrator has any suicidal motive), the model provides us with the chance of that attack succeeding, thus telling us something about its potential severity.
The model was trained using information regarding over 180,000 terror attacks spanning over 40 years. It also supports a decision breakdown detailing what specific parameters have led to a specific event being considered a great or minor risk.
Keren Halperin, Yaniv Cohen, Inbar Avni, Yehoshua Cohen, Shai Ben David
An ML/DL NLP based personality type classifier, trained to determine a person’s MBTI (Myers-Briggs Type Indicator) personality type based on their textual input.
The classification process utilizes both a multiple binary classification approach and a multi-class approach, with modeling ranging from classic ML models such
as Random Forest and Linear SVC to more SOTA methods such as double-layered LSTM and BERT.
The Last Dance
Predicting the success of an article before posting it
Jonas Sala, Roni Chauvart, Moriah Cohen Scali, Jacky Lalou, Georges Feledi
We built a model that we trained on a dataset containing 40,000 articles from Mashable news, in order to finally predict the popularity of a new article and detect viral articles based on the content, we used a classification model with 3 categories.
In our application, the user can send a text article, then we extract NLP features on the new article, and then we make our popularity prediction using our model.
Impressive isn’t it?
These students are now in their last phase of the Data Science program, which is working on a project with companies in the tech market.
Our next Fellows Data Science program starts in October – come join the <itc> family!