Week 3 and 4 at Metis Bootcamp was definitely increasingly onerous. This module combined two major topics — webscraping and linear regression. If you’re like me, with no math or stats background, some of the theories and concepts may seem a bit abstract by explaining with equations. Luckily, I found this channel on Youtube — StatQuest, that explains the concepts in graphs and made it so much easier to understand. Hope this is helpful to you as well!
Same as the last module, there’s a project due by the end of the second week. The second project at Metis is to use data scraped from a website, build linear regression models that address a useful prediction and/or interpretation problem in any domain of interest such as movies or sports.
Disclaimer: I am new to machine learning and also to blogging. So, if there are any mistakes, please do let me know. All feedback appreciated.
As a movie lover, I have always enjoyed watching and discussing movies with friends. When it comes to professional movie critiquing, the late Mr. Roger Ebert had reviewed the most and is the best-known movie critics of all time. For this linear regression project, I want to analyze what affects Roger Ebert’s rating and see if we can predict Roger Ebert’s movie rating if he were alive today?
The primary dataset was web-scraped from the film critic’s website with Python’s BeautifulSoup and Selenium libraries. Once the data was collected and cleaned, I realized that there weren’t enough features to create a robust model. I wanted other features such as user ratings and box office information. From Kaggle, I found a dataset containing other large movie rating sites — MovieLens and IMDb information for the movies. With the IMDb ID from this dataset, I was able to do a second data scraping from IMDb for the additional features. I decided not to scrap Rotton Tomato’s rating and Metacritic Scores since those ratings take in the critics’ rating for an average score. My original dataset contains 7847 datapoint and 6 features. After data merging and cleaning, I have 2191 datapoint and 11 features.
- BeautifulSoup and Selenium for web-scraping
- Pandas and numpy for data manipulation
- pickle for data storage
- Matplotlib and seaborn for plotting
- scikitlearn and statsmodel for modeling and testing
Each datapoint is an individual movie. The target variable is Ebert rating (on a scale from 0.0 to 4.0). Of the 11 features, there are 3 categorical features — Genre, sub-genre and MPAA rating (which are converted into dummy variables in the feature engineering process). The numerical features include year(movie released), runtime(in minutes), movieLens rating (on a scale from 0.0 to 5.0), IMDb rating (on a scale from 0.0 to 10.0), budget, domestic gross, opening week gross, and worldwide gross.
Looking at Ebert’s rating distribution, I see that he gave almost half of the movies a 3 to 3.5 rating and he also does not give too many terrible movie ratings like 0 to 0.5 stars, which might affect the model prediction in lower rating.
Using pairplot and heat map, I found movieLens rating and IMDb rating have higher coefficients with the target variable. But other numerical features do not seem to have apparent linear relationship with the target variable, which indicates some feature engineering might be needed.
- Mapping genre and subgenre columns into fewer categories (eliminate outliers)
- Converting categorical features to dummy variables
- Create new features with some feature interaction (i.e. Opening week gross proportion calculated by opening week gross divided by cumulative worldwide gross)
- Power-transform some numerical features to help with avoiding the outliers
My baseline model has an R-squared of 0.382, using the 9 features with p-values of less than 0.05. I built another 3 models — polynomial, Ridge, and Lasso and then I use 5-folds cross-validation test to evaluate which model performs the best. However, the results show no major difference between the models. R-squared values are close and low between models and between train and validation sets. This suggests that models may be underfitting. To improve my model, I increase complexity by adding more features and more feature engineering.
Besides the previous 4 models, I also tried using Ridge and Lasso regularizations on polynomial. Since polynomial has a higher training score than validation score, it indicates the polynomial model is overfitting. So I used Ridge and Lasso and tuned the regularization strength to hopingly find the sweet spot in the bias-variance trade-off graph.
Model Evaluation and Selection
After the iterative process of model refinement, tuning, and selection on validation test, I finally have a winner — Lasso. It performed slightly better than the other models on the cross-validation test.
After retraining the lasso model, I obtained an R-squared value of 0.396 and a mean absolute error of 0.55. Both are slightly better than the baseline. In layman’s terms, the prediction is off by 0.5 stars.
The prediction plot also shows that some of the lower rating predictions are doing slightly better. Given the data doesn’t have much lower rating examples, it is a challenge for the model to make a more accurate prediction of the lower ratings.
Here are some examples of how the model did-
Notice that Life of Pi and Pitch Perfect are within 0.5 ratings off. But for A Nightmare on Elm Street which has a lower rating has a weaker prediction.
And here, just for fun, I used the model to predict Ebert’s rating on some of my recent favorites…
Overall, I think my prediction model is not the best. But I got some insights from this project. One is the linear regression may not be the best prediction model for this dataset. Also, R-squared values are typically lower than 50% for predicting human behavior since humans are simply harder to predict than physical processes. (article: How to interpret R-squared in Regression Analysis by Jim Frost)
If I had more time, I’d try using a non-linear prediction model, such as tree-based models (Random Forest, Decision Tree, etc.). Secondly, I would get more data points by filling in missing values in the original dataset and web scrap more features. Lastly, I would build a flask app to deploy the prediction model.
Don’t dwell on R-square values too much. Although it serves as a good metric for validation tests on models, it does have some limitations. Here’s a good article for the explanation. Always look at other metrics such as Mean Absolute Error and Mean Squared Error to get a better understanding of the fit of the model.
Overall, I have learned a great deal in the past two weeks. It was definitely not easy. But it was rewarding getting through this module. I hope this project is interesting and insightful to you. Thanks for reading :)
You can find my project code here.