Classifying Subreddits Using Natural Language Processing
Overview
Are you looking for ways to improve your diet or thinking about trying the hot ketogenic diet millennials are obsessing over? r/Nutrition and r/Keto are two popular subreddits, with over 1.5 million members each. r/Nutrition focuses on nutrition science, macro/micro nutrients, health supplements, and overall diets, whereas r/Keto is specific to the ketogenic diet — thoughts, experiences, and keto lifestyle advice.
Using Pushshift’s API and Natural Language Processing (NLP), I gathered subreddit data to build different classification models and determine which model would be the best at classifying the origin subreddit of the submissions text.
TL;DR: Logistic Regression with the TF-IDF Vectorizer, English and custom stop words, max features of 1,000, and an ngram range of 1–2 words were the parameters that contributed to the best performing model. This model was able to classify the subreddit of Submission posts, with 91% accuracy, exceeding the baseline accuracy score of 55%.
Context & Challenge
Background & Description
This project was assigned to our cohort as part of General Assembly’s Data Science Immersive course. We had approximately two weeks to work on the project, whilst simultaneously learning the material for the project.
Problem — Why?
The project was designed to reinforce our learnings of APIs, NLP, and classification modeling throughout the course.
Goals & Objectives
The goal/objective of the project consisted of a Python notebook of clean code that included our data cleaning, exploratory data analysis, models testing and interpretation, based on classification metrics, and models analyses, as well as a presentation for a semi-technical audience that encompassed our findings, recommendations, and future steps to move the project forward.
Project Process, Insights, & Solution
Problem Statement
→ How can we determine the subreddit of a new post based on the text of its Submission, using classification modeling?
The metrics that were observed to measure success were accuracy score, f1-score, precision score, and recall score. Confusion matrices were also interpreted to determine the best model.
I decided to analyze the text of a subreddit Submission because I figured that would be where the majority of post context would live. In terms of choosing the selected evaluation metrics, I was mainly focused on the Positive values the models would correctly guess.
Data Collection & Cleaning
I collected my data by utilizing the Pushshift API and created a loop that pulled 100 rows (maximum limit for an API pull request from Reddit) of a subreddit’s Submission post 80 times(subreddit, submission title, and submission text), as long as the API returned a successful status code read (200 aka OK). This resulted in at least approximately 8,000+ rows altogether from each r/Keto and r/Nutrition subreddits. From here, I started the data cleaning process.
I first inspected the data and filtered out any null values. Upon inspection, I also noticed instances of duplicate subreddit posts. Perhaps this was a result of a user replying to a comment on a subreddit. Regardless, I also filtered out any duplicate values and any row that was ‘[removed]’ or [‘deleted’], as the post was likely taken down. I also noticed, particularly within r/Nutrition, that some posts were noted that they were ‘removed’, due to violating the subreddit’s community guidelines, such as dietary activism or providing medical advice. This made sense, as there are many people within the subreddit posting about their health and wellbeing.
After cleaning the data, since my project focused on classifying subreddits based on a submission selftext, I lemmatized that feature to derive the root form of a word. My thought process behind doing this was to ensure the models don’t double-count words, such as ‘eat’, ‘eating’, and ‘eats’.
Exploratory Data Analysis
I began visualizing the top 10 most common words of each subreddit using a Count Vectorizer and the default English stop words parameter. Initially, I noticed was that words, such as “like”, “just”, and parts of contractions (“don” and “ve”) appeared as most common words. Since these words aren’t meaningful, I added custom words to the stop words default list. This made the most common words “cleaner”.
The top 3 common words in r/Keto were “keto”, “eat”, and “weight”. The other common words were associated with people’s keto experience, particularly ‘feel’, which I thought was interesting. The top 3 common words in r/Nutrition were “eat”, “food”, and “protein”. It was also interesting to see “make” in the top common words. Perhaps people are looking for healthy homemade recipes. When looking at the most common words for each subreddit side by side, “diet” and “nutrition” appear within each of the top most common words. Additionally, the frequency of r/Keto’s most common words are significantly more than those in r/Nutrition, likely due to r/Keto focusing and specializing in the ketogenic diet and lifestyle, whereas r/Nutrition encompasses nutrition and diets as a whole.
Classification Models Tested
I tested several models, including Multinomial Naive Bayes with a Count Vectorizer and TF-IDF Vectorizer, Logistic Regression with a Count Vectorizer and TF-IDF Vectorizer, and Random Forest Classifier with a Count Vectorizer. Multinomial Naive Bayes was a mandatory model to test for the project. However, this classification model is beneficial when measuring text data, as text is a discrete feature and probabilities would be generated for each word that’s counted. As for the other models, I chose to test a Logistic Regression model based on its overall simplicity and the fact that the model can be used for multiclass classifications as well, despite this project having a binary target variable. I chose Random Forest Classifier because it reduces the variance from decision trees by ‘de-correlating’ trees using the random subsets of features, whereas decision trees would look at each feature every time, causing that high correlation among the trees.
The baseline accuracy score the models had to perform better than was 55%. I created a pipeline to vectorize each term within the ‘selftext’ feature and GridSearched over several parameters. (I find utilizing Pipelines and GridSearchCV to be extremely efficient and cuts out a lot of redundant steps.) Each model utilized English and custom stop words (‘ve’, ‘like’, ‘just’, ‘day’, ‘don’, ‘know’, ‘really’, ‘does’, ‘https’), an ngram range of just 1, 1–2, and 1–3 word(s), as well as GridSearched over different max_feature values and parameters specific to the model (‘alpha’ for Multinomial Naive Bayes, ‘C’ for Logistic Regression, and ‘max_depth’ and ‘max_features’ for Random Forest Classifier). Max features for Random Forest Classifier differs contextually from the max features for Vectorizers by looking at the number of features to consider when looking for the best split, whereas max features for Count Vectorizer sets a threshold when considering the top X features/terms.
The Results & Recommendations
As stated previously, the goal of the project was to perform data cleaning, exploratory data analysis, test different classification models, interpret the models based on classification metrics, as well as curate recommendations and next steps.
Conclusion
The best performing model was Logistic Regression, used with TF-IDF Vectorizer. This model can classify the subreddit of Submission posts with 91% accuracy, exceeding our baseline accuracy score of 55%. The parameters that performed best for this model included English stop words, with additional irrelevant custom stop words (e.g. “‘ve”, “really”, “just”), max features of 1,000, and an ngram range of 1-2 words. Compared to the other models that were tested, this model had the highest True Negatives (868) and the lowest False positives (92). Additionally, the lowest False positives led to the highest precision score amongst the other models tested. This model also had the highest f1-score.
The top 5 words within a selftext that best distinguished r/Nutrition submission posts included ‘eat’, ‘food’, ‘protein’, ‘fat’, and ‘calorie’. The top 5 words within a selftext that best distinguised r/Keto Submission posts include ‘keto’, ‘eat’, ‘weight’, ‘carb’, and ‘start’.
When comparing each model’s accuracy score against one another, the Multinomial Naive Bayes models scored the lowest, followed by Random Forest Classifier, and finally, the Logistic Regression models performed the best. Additionally, comparing each models’ Confusion Matrices, both Naive Bayes CountVectorizer and TF-IDF Vectorizer models predicted the lowest number of True Positives and almost twice as many False Negatives, resulting in the lowest precision, recall, and f1-score scores of the models tested.
In terms of what individuals or businesses that could benefit from the project, anyone who wants to effectively break down text to analyze, interpret, and formulate business recommendations would benefit from utilizing NLP.
Recommendations
- Lemmatize words prior to analyzation so that the model doesn’t count instances of the root word (‘eat’, ‘eats’, ‘eating’) separately. Lemmatizing would also optimize the model in terms of finding different patterns of words.
- Look at different ngram ranges to see what patterns of words occur the most frequent so that the model can train on those patterns
Future Project Refinements
There are several different routes I would like to take to further enhance my project and model analyses and generate different or potentially better results:
- Analyze the other subreddit features — post titles or comments, or a combination of the features
- Build and evaluate additional models, such as KNearestNeighbors or Decision Trees, with potential to utilize model boosting (AdaBoostClassifier, GradientBoostingClassifier)
- Further refine the custom stop words list by adding “eat” and “diet” and the subreddit topics’ words, “keto” and “nutrition”
- Try several models without lemmatization of the text