Discussion 7
Yelp data sentiment analysis
Project Goals
The two goals for the project are
-
Provide useful, analytics to business owners on Yelp from the datasets provided and, based on these insights, propose data-driven, actionable decisions to said owners in order to improve their ratings on Yelp.
-
Build a web dashboard/widget/web application that visualizes your analysis and makes it easier to understand for business owners.
Datasets
We included four JSON files for you. JSON files are the de-facto standard of transmitting data on the Web and this module will help you learn how to manipulate JSON files. The four JSON files are
review.json
: this contains the reviews. There are 942,027 entries of reviews.business.json
: this contains information about businesses. There are 36,327 entries of businesses.user.json
: this contains information about users. There are 272,024 entries of users.tip.json
: this contains information about tips written by users on a business. There are 129,571 tips generated by users.
In this discussion, we will load JSON files into Python, combine review.json
with business.json
, and use the nltk
module to parse the reviews.
For illustration purpose, I will sample the first 100,000 lines of reviews in shell:
head -n 100000 review_city.json >> sample_review.json
All of today’s job was done in Python. If you want to use R and tidyverse, feel free to checkout this book: Text Mining with R .
Then, we open a Jupyter Notebook and load sample_review.json
and business.json
into pandas dataframes using the pd.read_json
command.
import pandas as pd
review = pd.read_json('sample_review.json', lines=True)
review.head()
business = pd.read_json('business_city.json', lines=True)
business.head()
Then, similar to SQL
, we can join tables in pandas
using pd.merge
command. For more merging pandas dataframes, check out here.
df_raw = pd.merge(review, business, how='left', on='business_id')
df_raw.head()
Then, we use the DataFrame.join
method to see how many rows and columns are there:
df_raw.shape
# Output: (100000, 22)
Let’s look at the column names:
df_raw.columns
# Output: Index(['review_id', 'user_id', 'business_id', 'stars_x', 'useful', 'funny',
# 'cool', 'text', 'date', 'name', 'address', 'city', 'state',
# 'postal_code', 'latitude', 'longitude', 'stars_y', 'review_count',
# 'is_open', 'attributes', 'categories', 'hours'],
# dtype='object')
We can drop columns review_id
, user_id
, funny
, cool
, date
, latitude
, longitude
, is_open
because they can help little for our analysis:
df = df_raw.drop(columns=['business_id', 'review_id', 'user_id', 'funny', 'cool', 'date', 'latitude', 'longitude', 'is_open', 'review_count', 'hours'])
Let now check how many records are for Madison restaurants from the first 100,000 reviews:
sum(df.city == 'Madison')
# 11792
Suppose we want to study the restaurants in Madison, we can subset the dataset like this:
Madison = df[df.city == 'Madison']
Madison = Madison.drop(columns=['city', 'state'])
Re-index the Madison
dataframe:
Madison.reset_index(drop=True, inplace=True)
nltk (Natural Language Toolkit)
Then, we turn our focus on the reviews itself. We will use the nltk package.
- Popularity: NLTK is one of the leading platforms for dealing with language data.
- Simplicity: Provides easy-to-use APIs for a wide variety of text preprocessing methods.
Firstly, we assign the column text
to a new variable and convert it to a list. We then convert each review to lower cases using the string.lower()
method:
text = list(Madison.text)
Then, we remove punctuations. We save only the characters that are not punctuation, which can be checked by using string.punctuation
:
import string
print(string.punctuation)
# !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
We remove the punctuations using list comprehensions.
text = ["".join([char for char in a if char not in string.punctuation]) for a in text]
text[0]
# 'easily my favorite place to eat in madison great laothai food curries are fantastic and make sure to start with a soup theyre always good tip place is tiny and you will wait either get carry out or head across the street to the weary traveler to grab a beer while you wait for your table'
After that, strings can be tokenized into tokens via nltk.word_tokenize
:
from nltk import word_tokenize
words = word_tokenize(text_p)
If nltk
throws an error, perhaps you need to run the following command:
nltk.download('punkt')
In this way, words
contains all the words for each of the review.
Then, we’d better to one more step of stop word filtering, to exclude the words appearing everywhere like a, an, the, I, you, can.
There is a list of stopwords in the nltk
package, and we can access them through
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)
To combine different forms of a word together, such as organize, organizes, and organizing, we use the technique of lemmatization. Firstly, we need to download the corpus wordnet
:
nltk.download('wordnet')
Then we perform lemmatization:
from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
lemmatized_words = [[lmtzr.lemmatize(x) for x in a] for a in filtered_words]
Then, we can create a corpus based on lemmatized_words
.
…
You can find my Jupyter Notebook here.