Yelp data sentiment analysis

Project Goals

The two goals for the project are

Provide useful, analytics to business owners on Yelp from the datasets provided and, based on these insights, propose data-driven, actionable decisions to said owners in order to improve their ratings on Yelp.
Build a web dashboard/widget/web application that visualizes your analysis and makes it easier to understand for business owners.

Datasets

We included four JSON files for you. JSON files are the de-facto standard of transmitting data on the Web and this module will help you learn how to manipulate JSON files. The four JSON files are

review.json: this contains the reviews. There are 942,027 entries of reviews.
business.json: this contains information about businesses. There are 36,327 entries of businesses.
user.json: this contains information about users. There are 272,024 entries of users.
tip.json: this contains information about tips written by users on a business. There are 129,571 tips generated by users.

In this discussion, we will load JSON files into Python, combine review.json with business.json, and use the nltk module to parse the reviews.

For illustration purpose, I will sample the first 100,000 lines of reviews in shell:

head -n 100000 review_city.json >> sample_review.json

All of today’s job was done in Python. If you want to use R and tidyverse, feel free to checkout this book: Text Mining with R .

Then, we open a Jupyter Notebook and load sample_review.json and business.json into pandas dataframes using the pd.read_json command.

import pandas as pd
review = pd.read_json('sample_review.json', lines=True)
review.head()
business = pd.read_json('business_city.json', lines=True)
business.head()

Then, similar to SQL, we can join tables in pandas using pd.merge command. For more merging pandas dataframes, check out here.

df_raw = pd.merge(review, business, how='left', on='business_id')
df_raw.head()

Then, we use the DataFrame.join method to see how many rows and columns are there:

df_raw.shape
# Output: (100000, 22)

Let’s look at the column names:

df_raw.columns
# Output: Index(['review_id', 'user_id', 'business_id', 'stars_x', 'useful', 'funny',
#       'cool', 'text', 'date', 'name', 'address', 'city', 'state',
#       'postal_code', 'latitude', 'longitude', 'stars_y', 'review_count',
#       'is_open', 'attributes', 'categories', 'hours'],
#      dtype='object')

We can drop columns review_id, user_id, funny, cool, date, latitude, longitude, is_open because they can help little for our analysis:

df = df_raw.drop(columns=['business_id', 'review_id', 'user_id', 'funny', 'cool', 'date', 'latitude', 'longitude', 'is_open', 'review_count', 'hours'])

Let now check how many records are for Madison restaurants from the first 100,000 reviews:

sum(df.city == 'Madison')
# 11792

Suppose we want to study the restaurants in Madison, we can subset the dataset like this:

Madison = df[df.city == 'Madison']
Madison = Madison.drop(columns=['city', 'state'])

Re-index the Madison dataframe:

Madison.reset_index(drop=True, inplace=True)

nltk (Natural Language Toolkit)

Then, we turn our focus on the reviews itself. We will use the nltk package.

Popularity: NLTK is one of the leading platforms for dealing with language data.
Simplicity: Provides easy-to-use APIs for a wide variety of text preprocessing methods.

Firstly, we assign the column text to a new variable and convert it to a list. We then convert each review to lower cases using the string.lower() method:

text = list(Madison.text)

Then, we remove punctuations. We save only the characters that are not punctuation, which can be checked by using string.punctuation:

import string
print(string.punctuation)
# !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

We remove the punctuations using list comprehensions.

text = ["".join([char for char in a if char not in string.punctuation]) for a in text]
text[0]
# 'easily my favorite place to eat in madison  great laothai food  curries are fantastic and make sure to start with a soup  theyre always good  tip place is tiny and you will wait  either get carry out or head across the street to the weary traveler to grab a beer while you wait for your table'

After that, strings can be tokenized into tokens via nltk.word_tokenize:

from nltk import word_tokenize
words = word_tokenize(text_p)

If nltk throws an error, perhaps you need to run the following command:

nltk.download('punkt')

In this way, words contains all the words for each of the review.

Then, we’d better to one more step of stop word filtering, to exclude the words appearing everywhere like a, an, the, I, you, can.

There is a list of stopwords in the nltk package, and we can access them through

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

To combine different forms of a word together, such as organize, organizes, and organizing, we use the technique of lemmatization. Firstly, we need to download the corpus wordnet:

nltk.download('wordnet')

Then we perform lemmatization:

from nltk.stem.wordnet import WordNetLemmatizer
lmtzr = WordNetLemmatizer()
lemmatized_words = [[lmtzr.lemmatize(x) for x in a] for a in filtered_words]

Then, we can create a corpus based on lemmatized_words.

…

You can find my Jupyter Notebook here.

Discussion 7

Yelp data sentiment analysis

Project Goals

Datasets

nltk (Natural Language Toolkit)