ML Classification Model to Predict Kickstarter Campaign Success
“Predicting the future isn’t magic, it’s artificial intelligence.” — Dave Waters
Everyone loves the idea of being clairvoyant, yet not everyone knows we have this hidden ability by using Machine Learning ;)
We’ve learned supervised learning in regression a few weeks ago. And in weeks 7 and 8, we’re learning the other category of supervised learning — Classification. And it is a big one! There are not just a lot more algorithms to learn but also the metrics that evaluate the model performances (e.g. Confusion matrix is still confusing). Here’s a great article to read if you’re just getting started with learning classification. Also, some Youtube channels like StatQuest are pretty helpful if you’re a visual learner like me.
For the classification project, we were asked to use any appropriate data source, build classification models that address a useful prediction and/or interpretation problem in any domain of interest.
So, without further ado, I’ll dive into it…
Disclaimer: I am new to machine learning and also to blogging. So, if there are any mistakes, please do let me know. All feedback appreciated.
Kickstarter is a powerful crowdfunding platform that helps to bring creative projects to life. As of today, it has helped to bring over 200k projects to life and raised a total of more than 5 billion dollars. However, the challenge of using Kickstarter is its all-or-nothing funding model, which means no one is charged for a pledge towards a project unless the funding goal is reached. This can be a huge burden on the creators as they have invested time and money into the campaign. On the other hand, backers wouldn’t want to miss an amazing opportunity for the limited rewards from a successful campaign as well.
So, the goal of this project is to build a classification model to predict the success of a Kickstarter campaign.
Data
The datasets used in this project came from Web Robots website which compiles web-scrapped Kickstarter data monthly. I used the datasets from January through April 2021, which have 870,114 data points and 38 columns. Since the dataset was large, I used SQLite and sqlachemy for data storage and access.
Target: Success Outcome (whether the campaign is successful or not)
- 59% Success, 41% Failed (Not too imbalanced)
Features:
- Amount of backers, pledges (were not used due to the futuristic nature)
- Campaign goal in USD
- Campaign duration (from Launched date to Deadline)
- Preparation duration (from Created date to Launched date)
- Location of the campaign (US-based or not)
- Length of the campaign description
- Category of the campaign (dummify)
- Getting Featured on Kickstarter (Future value; but was kept so the backers can use this to check on the potential campaign, and creators can use this to see what their outcome would be with or without it)
Final dataset: 189,162 data points, 20 features
Tools and Algorithms Used
Tools
- SQLite, sqlalchemy for data storage and access
- Pandas, numpy for data manipulation
- Tableau, Matplotlib, and Seaborn for plotting graphs
- Scikit-learn for modeling and testing
- Flask for application production
Classification Algorithms
- KNN
- Logistic Regression
- Decision Tree, Random Forest
- Naive Bayes — Gaussian, Bernoulli
- XGBoost
Metric Selection
- ROC AUC curve — for model comparison
- F1 score — Since creators wouldn’t want the model to predict too many success that will turn out to be a failure (minimize False Positives) and backers would want to make sure the model capture as many success as possible (minimize False Negatives), I want a balance between precision and recall
- Confusion matrix — shows actual prediction results
Model Evaluation and Selection
For my baseline model, I used logistic regression with regularization on a small subset of features and got an F1 score of 0.75 and AUC of 0.65.
After more feature engineering (preparation duration) and using the full set of features, the logistic regression model perform slightly better with an F1 score of 0.77 and AUC 0.77.
Then, I tried various classification algorithms such as KNN, Decision Tree, Random Forest, Naive Bayes (Gaussian and Bernoulli), and XGBoost. And the top 3 models are XGBoost (AUC =0.82), Random Forest (AUC =0.79), and Logistic Regression (AUC = 0.77).
With the top 3 models, I used GridSearchCV to tune the hyperparameters to find the best performing one.
The best performing model is XGBoost with F1 0.8 and AUC 0.82. Comparing to the baseline model (F1=0.75, AUC=0.65), the metric was slightly improved.
The confusion matrix also shows that the model captures most of the True Positives and True Negatives. In laymen’s terms, the model captured more actual outcomes than missing them.
Model Interpretation
For interpretation of the XGBoost model, I used a metric called SHAP value to examine which features have higher importance for success.
The color bar shows the magnitude of each feature. Horizontal direction shows the feature’s impact on the outcome (right: positive, left: negative). The cluster seen on the graph indicates a higher number of campaigns. For example, setting a smaller goal (in blue) has a positive impact on many campaigns whereas setting a larger goal (in red) has a negative impact on a small proportion of campaigns. Shorter preparation time might be hurtful to the campaign. Getting featured boosts the campaigns that have it, but not getting it doesn’t hurt too much. Stretching out the campaign too long can have a negative impact on the campaign. Lastly, the categories are being compared to Art (due to dummy variables). So, comics and games do better than art while food and crafts do worse.
Application Usage
Use case 1
To demonstrate this model, I built a Flask app for creators and backers to check on their campaigns.
Use case 2
In the case where the outcome is bleak,
I can provide the creators with some suggestions based on the feature importance with SHAP values.
Future Work
Going forward, I would like to build a more robust model with more data and feature engineering. With the stronger model, I’d be able to deploy the Flask app on Heroku with more functionality and better visuals. Lastly, I would look into how quickly the campaigns are being fulfilled.
Reference
An In-Depth Guide to Supervised Machine Learning Classification
Takeaways
It’s always helpful to scope out the project and understand the end goal before building models. It can give you some clarity as to what metrics to use, what algorithms to build, and what MVP (minimal viable product) you’d like to see.
XGBoost may seem like the best model in most cases. However, it does require some tuning with the hyperparameters to reach the best performance. If you’re on a time crunch, Random Forest may be a better option.
The past two weeks were very intense yet flew by so fast. We just passed the mid-way mark of the Bootcamp and some of us are definitely feeling that mid-point fatigue/burnout. I think it’s always good to remind ourselves that, this is a marathon (even an ultra marathon, some might say) and not a sprint.
If you’re getting mental fatigue, take some break and try these strategies which I found pretty useful! If I’ve learned anything from the pandemic — taking good care of your body and mind is more important than anything else.
Thanks for reading :)
You can find my project work on my GitHub repo