
The full code is available on Github.
Word2vec is a very popular Natural Language Processing technique nowadays that uses a neural network to learn the vector representations of words called “word embeddings” in a particular text.
In this tutorial, we will use the excellent implementation of word2vec from the gensimpackage to build our word2vec model.
Data Presentation
We will use the Amazon review corpus on Health and Personal Care. The dataset is in json format and contains 346,355 reviews.
All reviews text to one string
We need to combine all the reviews text of the dataset into one string. For that, we will first use Pandas to load the dataset. We can do so with the following code.
import pandas as pd # load the data into panda dataframe data_file_name = "Health_and_Personal_Care_5.json" raw_df = pd.read_json(data_file_name, lines=True) print("Data loaded")
After loading the data, we can view information about the data by using the info()
of Pandas DataFrame. We get the following result:
We see that the DataFrame has many columns but we are only interested in the column that contains the review text. That column name is reviewText. We will use Python join
method on that column to combine all the reviews text in one string.
# Convert all the review text into a long string and print its length raw_corpus = u"".join(raw_df['reviewText']+" ") print("Raw Corpus contains {0:,} characters".format(len(raw_corpus)))
Tokenization into sentences
gensim’s word2vec expects a sequence of sentences as its input, each one as a list of words. We then need to split the string that we obtain in the above section into sentences. For that, we will use NLTK‘s punkt tokenizer for sentence splitting. In order to use this, we will need to install NLTK and download the relevant training file for punkt. The following code help to download.
# import natural language toolkit import nltk # download the punkt tokenizer nltk.download('punkt') print("The punkt tokenizer is downloaded") # Load the punkt tokenizer tokenizer = nltk.data.load("tokenizers/punkt/english.pickle") print("The punkt tokenizer is loaded")
We will now load the punkt tokenizer and use it to split our very long string into sentences.
# Load the punkt tokenizer tokenizer = nltk.data.load("tokenizers/punkt/english.pickle") print("The punkt tokenizer is loaded") # we tokenize the raw string into raw sentences raw_sentences = tokenizer.tokenize(raw_corpus) print("We have {0:,} raw sentences".format(len(raw_sentences)))
After tokenization, we see that our dataset contains 1,824,643 sentences.
Clean and split sentence into words
We now need to clean the sentences we get after tokenization. The cleaning consists to removes punctuation, parentheses, question marks, etc., and leaves only alphabetic character. Also gensim’s word2vec expects each sentence to be a list of words. So we need to convert each clean sentence to a list of words.
import re # Clean and split sentence into words def clean_and_split_str(string): strip_special_chars = re.compile("[^A-Za-z]+") string = re.sub(strip_special_chars, " ", string) return string.strip().split() # clean each raw sentences and build the list of sentences sentences = [] for raw_sent in raw_sentences: if len(raw_sent) > 0: sentences.append(clean_and_split_str(raw_sent)) print("We have {0:,} clean sentences".format(len(sentences)))
The following code help to count the number of token present in our dataset.
token_count = sum([len(sentence) for sentence in sentences]) print("The dataset corpus contains {0:,} tokens".format(token_count))
Our dataset contains 33,476,197 tokens.
Setting the numerical parameters
Gensim’s Word2Vec API accepts several parameters that affect both training speed and quality. The important parameters are:
- size :It is the dimensionality of the resulting word vectors
- min_count : It is the minimum word count threshold. In order words, we ignore all words with total frequency lower than this.
- workers : It is the number of threads to run in parallel.
- window : It is the maximum distance between the current and predicted word within a sentence.
- seed : it is for the RNG, to make the result reproducible.
The following code shows the value of the parameters of our model:
import multiprocessing #Dimensionality of the resulting word vectors num_features = 300 #Minimum word count threshold min_word_count = 3 #Number of threads to run in parallel num_workers = multiprocessing.cpu_count() #Context window length context_size = 7 #Seed for the RNG, to make the result reproducible seed = 1
Train our Word2Vec
We will now use the parameters above to initialize our word2vec model. After the initialisation, we will fisrt build the vocabulary of our dataset. After that, we will train our word2vec model.
import gensim word2vec_model = gensim.models.word2vec.Word2Vec( sg=1, seed=seed, workers=num_workers, size=num_features, min_count=min_word_count, window=context_size) word2vec_model.build_vocab(sentences=sentences) print("The vocabulary is built") print("Word2Vec vocabulary length: ", len(word2vec_model.vocab)) #Start training the model word2vec_model.train(sentences=sentences) print("Training finished")
The training of the model may takes a few minutes. The vocabulary of our dataset contains 62972 words.
Storing and loading
After training our model, we will now save it for future use.
#Save the model word2vec_model.save("word2vec_model_trained_on_Health_and_Personal_Care_5.w2v") print("Model saved")
We can load our model with the following code.
# Load our word2vec model import gensim w2v_model = gensim.models.word2vec.Word2Vec.load("word2vec_model_trained_on_Health_and_Personal_Care_5.w2v") print("Model loaded")
Model visualization
After the training, we can visualize the learned embeddings using t-SNE. t-SNE is a tool for data visualization that reduces the dimensionality of data to 2 or 3 dimensions so that it can be plotted easily. Because the space complexity of the t-SNE algorithm is quadratic, in this tutorial we will view only a part of our model. We use the following code to select 10,000 words from our vocabulary.
count = 10000 word_vectors_matrix = np.ndarray(shape=(count, 300), dtype='float64') word_list = [] i = 0 for word in w2v_model.vocab: word_vectors_matrix[i] = w2v_model[word] word_list.append(word) i = i+1 if i == count: break print("word_vectors_matrix shape is ", word_vectors_matrix.shape)
We will now initialize model and compress our word vectors into 2D space.
#Compress the word vectors into 2D space import sklearn.manifold tsne = sklearn.manifold.TSNE(n_components=2, random_state=0) word_vectors_matrix_2d = tsne.fit_transform(word_vectors_matrix) print("word_vectors_matrix_2d shape is ", word_vectors_matrix_2d.shape)
We build a Pandas Dataframe that contains the selected words and the x and y coordinates of each word.
points = pd.DataFrame( [ (word, coords[0], coords[1]) for word, coords in [ (word, word_vectors_matrix_2d[word_list.index(word)]) for word in word_list ] ], columns=["word", "x", "y"] ) print("Points DataFrame built") points.head(10)
The DataFrame looks as follow:
We can then use the Points
DataFrame to plot our words vectors.
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline sns.set_context("poster") points.plot.scatter("x", "y", s=10, figsize=(20, 12))
We will zoom in to some region in order to see the similarity of the words. We create a function that create a bounding box of x and y coordinates and plot only the words between that bounding box.
def plot_region(x_bounds, y_bounds): slice = points[ (x_bounds[0] <= points.x) & (points.x <= x_bounds[1]) & (y_bounds[0] <= points.y) & (points.y <= y_bounds[1]) ] ax = slice.plot.scatter("x", "y", s=35, figsize=(10, 8)) for i, point in slice.iterrows(): ax.text(point.x + 0.005, point.y + 0.005, point.word, fontsize=11) plot_region(x_bounds=(-2.2, -2.0), y_bounds=(-2.25, -2))
As expected, words that are similar end up clustering nearby each other.
Most similar words
One way to check if we have a good word2vec model is to use the model to find the most similar words to a specific word. For that, we can use the most_similar
function that returns the 10 most similar words to the given word. Let’s find the most similar words to the word blue
.
w2v_model.most_similar("blue")
As espected, the output show words that are really similar to the given word.
The full code is available on Github.
References
How to Make Word Vectors from Game of Thrones (LIVE)
Thanks for reading. Please leave feedback and questions in the comments!
I have 2 words sentence. ex.
line 1 LED Bulb
line 2 Ikea Bed
line 3 white toilet bowl
like these billions of records
I am setting below word2vec hyperparms :
num_features = 300 , min_word_count = 2
num_workers = multiprocessing.cpu_count()
context_size = 2 { as my sentence itself is of 2 words } I am confused with this param a bit . Because if it goes beyond 2 its altogether different sentence.
my goal is to achieve similarity to find if anything we put like LED out of so many words it should give more similarity to words like bulb , TV etc
do you think is this right candidate to find cosine similarity and solve this ? any info would be appreciated