Text classification is a key part of Outbrain’s technological stack, it helps to personalize recommendations, reach targeted audiences, segment users, etc. Outbrain’s topical taxonomy currently consists of around 100 classes covering key concepts such as Sports, Investing, News, etc. Having accurate categories to describe each document is very important in order to deliver the best quality of recommendations.
Train a multi-class text classification model on a newly generated dataset of web pages. The project involves the creation of a complete ML pipeline that would process both the text of the web page itself and the metadata around it to predict accurately the category of the document. NLP techniques and standard machine learning methods are the key parts of the project which also involves a fair amount of both feature engineering and text cleaning. One objective of the project is also to judge the quality of the new data set and the importance of a bigger data set in the accuracy of the prediction.
Step by Step Report
Three main steps are composing this 5 weeks project:
- Data Understanding and Preprocessing
- The data is composed of the text of a web page from Outbrain’s traffic and its metadata (title, url, extracted entities, …), the labels are the best(s) corresponding category(ies).
- The text cleaning is a crucial part of the pipeline and this is why we tested different type and combination of cleaning method (classic cleaning, lower case, stopwords removal, entities tagging, lemming/stemming but also removal of redundant sentences over the documents).
- The feature engineering was also pretty interesting involving different types of features:
- Entities based features
- URL based features
- LDA Modelling
- TF IDF features on title and content
2. Modelling and Prediction
- Model Choice: The best performing model were (as usual in NLP) SVM, Logistic Regression and Naive Bayes. With a need for normalization between the different type of features.
- Cross Validation: With 5 folds on the new data set to evaluate accuracy and precision on the different classes.
3. Result Analysis
- Comparison of the Classification Report (109 classes.)
- Head to Head comparison of the results to the model in production.
One of the challenging aspects of this project relies on the dataset itself and on the evaluation process. Since it is a new data set my job was at the same time to try and use it to get the best lift and evaluate its quality. In this condition, it has hard to discern to what extent an improvement in the accuracy of prediction is imputable to the new dataset or to the new modelling method. Another challenging part was on the data cleaning since our data comes directly from web pages, its quality isn’t always the same depending on the publishers for instance. This can introduce a bias in the way we classify which we want to avoid.
Achievements (according to KPIs)
- Creation and training of new models such as SVM (Linear), Random Forest, Logistic Regression and Naive Bayes on the new dataset:
→ The best performing model is a Linear SVM trained on TF IDF features giving a small lift in accuracy in cross validation and a bigger one in head to head comparison.
- Evaluating the quality of the dataset:
→ The new dataset enables to have better result and a fair amount of the accuracy lift on the head to head comparison is imputable to this new data.
→ Even more data should continue improving the accuracy of the models.
- Try deep learning model as well as embeddings.
- Further feature engineering.
- Leverage the Head to Head comparison results and maybe try stacking or some kind of ensemble method.
- Put model into production.