Build and Visualize Word2Vec Model on Amazon Reviews

The full code is available on Github.

Word2vec is a very popular Natural Language Processing technique nowadays that uses a neural network to learn the vector representations of words called “word embeddings” in a particular text.

In this tutorial, we will use the excellent implementation of word2vec from the gensimpackage to build our word2vec model.

Data Presentation

We will use the Amazon review corpus on Health and Personal Care. The dataset is in json format and contains 346,355 reviews.

All reviews text to one string

We need to combine all the reviews text of the dataset into one string. For that, we will first use Pandas to load the dataset. We can do so with the following code.

import pandas as pd
# load the data into panda dataframe
data_file_name = "Health_and_Personal_Care_5.json"
raw_df = pd.read_json(data_file_name, lines=True)
print("Data loaded")

After loading the data, we can view information about the data by using the info()of Pandas DataFrame. We get the following result:

We see that the DataFrame has many columns but we are only interested in the column that contains the review text. That column name is reviewText. We will use Python join method on that column to combine all the reviews text in one string.

# Convert all the review text into a long string and print its length
raw_corpus = u"".join(raw_df['reviewText']+" ")
print("Raw Corpus contains {0:,} characters".format(len(raw_corpus)))

Tokenization into sentences

gensim’s word2vec expects a sequence of sentences as its input, each one as a list of words. We then need to split the string that we obtain in the above section into sentences. For that, we will use NLTK‘s punkt tokenizer for sentence splitting. In order to use this, we will need to install NLTK and download the relevant training file for punkt. The following code help to download.

# import natural language toolkit
import nltk
# download the punkt tokenizer
nltk.download('punkt')
print("The punkt tokenizer is downloaded")

# Load the punkt tokenizer
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
print("The punkt tokenizer is loaded")

We will now load the punkt tokenizer and use it to split our very long string into sentences.

# Load the punkt tokenizer
tokenizer = nltk.data.load("tokenizers/punkt/english.pickle")
print("The punkt tokenizer is loaded")
# we tokenize the raw string into raw sentences
raw_sentences = tokenizer.tokenize(raw_corpus)
print("We have {0:,} raw sentences".format(len(raw_sentences)))

After tokenization, we see that our dataset contains 1,824,643 sentences.

Clean and split sentence into words

We now need to clean the sentences we get after tokenization. The cleaning consists to removes punctuation, parentheses, question marks, etc., and leaves only alphabetic character. Also gensim’s word2vec expects each sentence to be a list of words. So we need to convert each clean sentence to a list of words.

import re
# Clean and split sentence into words
def clean_and_split_str(string):
    strip_special_chars = re.compile("[^A-Za-z]+")
    string = re.sub(strip_special_chars, " ", string)
    return string.strip().split()
    
# clean each raw sentences and build the list of sentences
sentences = []
for raw_sent in raw_sentences:
    if len(raw_sent) > 0:
        sentences.append(clean_and_split_str(raw_sent))
print("We have {0:,} clean sentences".format(len(sentences)))

The following code help to count the number of token present in our dataset.

token_count = sum([len(sentence) for sentence in sentences])
print("The dataset corpus contains {0:,} tokens".format(token_count))

Our dataset contains 33,476,197 tokens.

Setting the numerical parameters

Gensim’s Word2Vec API accepts several parameters that affect both training speed and quality. The important parameters are:

  • size :It is the dimensionality of the resulting word vectors
  • min_count : It is the minimum word count threshold. In order words, we ignore all words with total frequency lower than this.
  • workers : It is the number of threads to run in parallel.
  • window : It is the maximum distance between the current and predicted word within a sentence.
  • seed : it is for the RNG, to make the result reproducible.

The following code shows the value of the parameters of our model:

import multiprocessing

#Dimensionality of the resulting word vectors
num_features = 300

#Minimum word count threshold
min_word_count = 3

#Number of threads to run in parallel
num_workers = multiprocessing.cpu_count()

#Context window length
context_size = 7

#Seed for the RNG, to make the result reproducible
seed = 1

Train our Word2Vec

We will now use the parameters above to initialize our word2vec model. After the initialisation, we will fisrt build the vocabulary of our dataset. After that, we will train our word2vec model.

import gensim

word2vec_model = gensim.models.word2vec.Word2Vec(
    sg=1,
    seed=seed,
    workers=num_workers, 
    size=num_features, 
    min_count=min_word_count, 
    window=context_size)
    
word2vec_model.build_vocab(sentences=sentences)
print("The vocabulary is built")
print("Word2Vec vocabulary length: ", len(word2vec_model.vocab))

#Start training the model
word2vec_model.train(sentences=sentences)
print("Training finished")

The training of the model may takes a few minutes. The vocabulary of our dataset contains 62972 words.

Storing and loading

After training our model, we will now save it for future use.

#Save the model
word2vec_model.save("word2vec_model_trained_on_Health_and_Personal_Care_5.w2v")
print("Model saved")

We can load our model with the following code.

# Load our word2vec model
import gensim
w2v_model = gensim.models.word2vec.Word2Vec.load("word2vec_model_trained_on_Health_and_Personal_Care_5.w2v")
print("Model loaded")

Model visualization

After the training, we can visualize the learned embeddings using t-SNE. t-SNE is a tool for data visualization that reduces the dimensionality of data to 2 or 3 dimensions so that it can be plotted easily. Because the space complexity of the t-SNE algorithm is quadratic, in this tutorial we will view only a part of our model. We use the following code to select 10,000 words from our vocabulary.

count = 10000
word_vectors_matrix = np.ndarray(shape=(count, 300), dtype='float64')
word_list = []
i = 0
for word in w2v_model.vocab:
    word_vectors_matrix[i] = w2v_model[word]
    word_list.append(word)
    i = i+1
    if i == count:
        break
print("word_vectors_matrix shape is ", word_vectors_matrix.shape)

We will now initialize model and compress our word vectors into 2D space.

#Compress the word vectors into 2D space
import sklearn.manifold
tsne = sklearn.manifold.TSNE(n_components=2, random_state=0)
word_vectors_matrix_2d = tsne.fit_transform(word_vectors_matrix)
print("word_vectors_matrix_2d shape is ", word_vectors_matrix_2d.shape)

We build a Pandas Dataframe that contains the selected words and the x and y coordinates of each word.

points = pd.DataFrame(
    [
        (word, coords[0], coords[1]) 
        for word, coords in [
            (word, word_vectors_matrix_2d[word_list.index(word)])
            for word in word_list
        ]
    ],
    columns=["word", "x", "y"]
)
print("Points DataFrame built")
points.head(10)

The DataFrame looks as follow:

We can then use the Points DataFrame to plot our words vectors.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_context("poster")
points.plot.scatter("x", "y", s=10, figsize=(20, 12))

We will zoom in to some region in order to see the similarity of the words. We create a function that create a bounding box of x and y coordinates and plot only the words between that bounding box.

def plot_region(x_bounds, y_bounds):
    slice = points[
        (x_bounds[0] <= points.x) &
        (points.x <= x_bounds[1]) &
        (y_bounds[0] <= points.y) &
        (points.y <= y_bounds[1]) 
    ]
    
    ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8))
    for i, point in slice.iterrows():
        ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11)

plot_region(x_bounds=(-2.2, -2.0), y_bounds=(-2.25, -2))

As expected, words that are similar end up clustering nearby each other.

Most similar words

One way to check if we have a good word2vec model is to use the model to find the most similar words to a specific word. For that, we can use the most_similarfunction that returns the 10 most similar words to the given word. Let’s find the most similar words to the word blue.

w2v_model.most_similar("blue")

As espected, the output show words that are really similar to the given word.

The full code is available on Github.

References

How to Make Word Vectors from Game of Thrones (LIVE)

Thanks for reading. Please leave feedback and questions in the comments!

You May Also Like

About the Author: Miguel KAKANAKOU

4 Comments

  1. I have 2 words sentence. ex.

    line 1 LED Bulb
    line 2 Ikea Bed
    line 3 white toilet bowl
    like these billions of records

    I am setting below word2vec hyperparms :
    num_features = 300 , min_word_count = 2
    num_workers = multiprocessing.cpu_count()

    context_size = 2 { as my sentence itself is of 2 words } I am confused with this param a bit . Because if it goes beyond 2 its altogether different sentence.

    my goal is to achieve similarity to find if anything we put like LED out of so many words it should give more similarity to words like bulb , TV etc

    do you think is this right candidate to find cosine similarity and solve this ? any info would be appreciated

  2. Hello Miguel great tutorials. Please I am working on automatic customer reviews using data from Amazon. I am using AIMS desktop and I find it difficult reading my data using python, the laptop keeps freezing(10,000+ data inputs). I don’t have the package Pandas any alternative?

    Would be glad if you could help.

Leave a Reply to Salomey Addo Cancel reply

WP to LinkedIn Auto Publish Powered By : XYZScripts.com
%d bloggers like this: