We improve performance and tune existing models that classify articles as relevant or not to a particular news feed by article content. We do this by hyper-parameter tuning of existing models, feature selection and dimensionality reduction. We also create a configurable standalone data analysis tool for end users to enable them to more closely examine and explore the data included in these models. This tool creates a report that enables a quick iteration of exploratory data analysis. To accomplish this we include in the report different optimal feature selection techniques and and data visualization panels to explore the distribution and correlation of standalone features and their correlations with each other.
After examining model performance the task was changed to build self configurable data analysis tool that allows feature exploration and engineering future features. The tool needed to interact with the existing system infrastructure while taking into account complex organization needs.
Midway through the development we realized that exploratory data analysis would be extremely difficult with the large number of features involved in the current models (~400 features). To solve this issue we developed a feature selection voting mechanism that produces a small subset of important features, that should be more closely examined through visualizations and interactions.
The large number of features, and the involvement in the already very well-developed data-scientific infrastructure implied dealing with a huge number of features that we do not fully understand their engineering and their significance to the business.
Many of the features include a large number of missing values. Currently the models fill these missing values with the mean, median, minimal value or maximal value, but we came up with different possibilities, as we elaborate in the “Further development” section.
Achievements (according to KPIs)
- We were able to improve precision and recall of different classification models by tuning their hyper-parameters.
- Through our data exploration we realized that time is an important component in the determination of relevancy, though it is not taken into account currently in the models. We elaborate on this in the further development section.
This project can still develop in different ways:
- Studying the effects of reducing dimensionality, either by using feature selection methods or by using dimensionality reduction methods (PCA, SVD, LDA, etc.), on model performance.
- Studying the effect of reducing dimensionality by removing features with a large number of missing values. Currently, the models fill missing values using statistics describing the features’ distributions (mean, median, min, max), but potentially dropping the feature altogether or filling missing values using other methods (random selection from the features’ distributions, e.g.,) might improve model performance.
- Studying the effect of running different models, using different subsets of features, on subsets of observations that include values for said features: This is instead of filling missing values, as described above.
- Modelling the data as time series. Exploratory data analysis during the internship suggests that potentially the observations are time-dependent, though the time trends are not currently modeled in classifying relevancy. Examining the time trends and including them in future analyses might improve the model’s predictions.