DECEPTIVE SPAM REVIEW DETECTION WITH CNN USING TENSORFLOW (PART 1)

The full code is available on Github.

In this tutorial we will implement a model similar to the SCNN model of Luyang Li’s Document representation and feature combination for deceptive spam review detection. In that paper, the SCNN model apply convolutional neural network (CNN) technique to detect deceptive spam review. A Deceptive opinion spam is a review with fictitious opinions which is deliberately written to sound authentic. Deceptive spam review detection can then be thought as of the exercise of taking a review and determining whether is a spam or a truth.

In this first part of the tutorial, we will focus on the data preprocessing phase.

Data Presentation

We will use the first publicly available gold standard corpus of deceptive opinion spam. The dataset consists of truthful and deceptive hotel reviews of 20 Chicago hotels. It contains 400 truthful positive reviews from TripAdvisor; 400 deceptive positive reviews from Mechanical Turk; 400 truthful negative reviews from Expedia, Hotels.com, Orbitz, Priceline, TripAdvisor and Yelp; and 400 deceptive negative reviews from Mechanical Turk.

Loading data

First, we will build two lists, the first one will contain all the link of the thruthuls reviews and the second one will contain all link of the deceptives reviews.

truthful_pos = 'op_spam_v1.4/positive_polarity/truthful_from_TripAdvisor/'
truthful_neg = 'op_spam_v1.4/negative_polarity/truthful_from_Web/'

deceptive_pos = 'op_spam_v1.4/positive_polarity/deceptive_from_MTurk/'
deceptive_neg = 'op_spam_v1.4/negative_polarity/deceptive_from_MTurk/'

truthful_reviews_link = []

for fold in os.listdir(truthful_pos):
    foldLink = os.path.join(truthful_pos, fold)
    if os.path.isdir(foldLink):
        for f in os.listdir(foldLink):
            fileLink = os.path.join(foldLink, f)
            truthful_reviews_link.append(fileLink)

for fold in os.listdir(truthful_neg):
    foldLink = os.path.join(truthful_neg, fold)
    if os.path.isdir(foldLink):
        for f in os.listdir(foldLink):
            fileLink = os.path.join(foldLink, f)
            truthful_reviews_link.append(fileLink)

deceptive_reviews_link = []

for fold in os.listdir(deceptive_pos):
    foldLink = os.path.join(deceptive_pos, fold)
    if os.path.isdir(foldLink):
        for f in os.listdir(foldLink):
            fileLink = os.path.join(foldLink, f)
            deceptive_reviews_link.append(fileLink)

for fold in os.listdir(deceptive_neg):
    foldLink = os.path.join(deceptive_neg, fold)
    if os.path.isdir(foldLink):
        for f in os.listdir(foldLink):
            fileLink = os.path.join(foldLink, f)
            deceptive_reviews_link.append(fileLink)

Just to make sure we have considered every file, we can look at the lenght of the list we have built.

print('Number of truthfuls reviews ', len(truthful_reviews_link))
print('Number of deceptives reviews ', len(deceptive_reviews_link))

Build the Vocabulary

Now we will build the vocabulary of our dataset and also a list that will contain the number of words contains in each reviews. We will use that list later to now the average number of words in each review. The following piece of code will build the vocabulary and the list.

def clean_str(string):
    """
    Tokenization/string cleaning for all datasets except for SST.
    Original taken from https://github.com/yoonkim/CNN_sentence/blob/master/process_data.py
    """
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)
    string = re.sub(r"\'s", " \'s", string)
    string = re.sub(r"\'ve", " \'ve", string)
    string = re.sub(r"n\'t", " n\'t", string)
    string = re.sub(r"\'re", " \'re", string)
    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)
    string = re.sub(r"\)", " \) ", string)
    string = re.sub(r"\?", " \? ", string)
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().lower()

def handleFile(filePath):
    with open(filePath, "r") as f:
        lines=f.readlines()
        file_voc = []
        file_numWords = 0
        for line in lines:
            cleanedLine = clean_str(line)
            cleanedLine = cleanedLine.strip()
            cleanedLine = cleanedLine.lower()
            words = cleanedLine.split(' ')
            file_numWords = file_numWords + len(words)
            file_voc.extend(words)
    return file_voc, file_numWords


allFilesLinks = truthful_reviews_link + deceptive_reviews_link
vocabulary = []
numWords = []
for fileLink in allFilesLinks:
    file_voc, file_numWords = handleFile(fileLink)
    vocabulary.extend(file_voc)
    numWords.append(file_numWords)

vocabulary = set(vocabulary)
vocabulary = list(vocabulary)

print('The total number of files is ', len(numWords))
print('The total number of words in the files is ', sum(numWords))
print('Vocabulary size is ', len(vocabulary))
print('The average number of words in the files is', sum(numWords)/len(numWords))

With the numWords list, we can build a histogram to visualize the data.

"""Visualize the data in histogram format"""
plt.hist(numWords, 50)
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.show()

From the histogram, we can safely say that most reviews will fall under 160 words which is the max sequence length value we will set.

Word embeddings

We will use pretrained word2vec vectors for our word embeddings. in this tutorial, we will use Google Word2Vec pretrained model. The model was trained on a massive Google News dataset that contained over 100 billion different words. It contains 300-dimensional vectors for 3 million words and phrases.

Since the word vectors matrix is quite large (3.6 GB) and contains a lot of words that unnecessary for us (3 million words but our vocabulary size is 9687), we will first build a much more manageable matrix that will contains only the words that we need.

For the words that occur in our vocabulary but not in the pretrained model, we will create a separate word vector as it is explained here

w2v_model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
wordsVectors = []
notFoundwords = []
for word in vocabulary:
    try:
        vector = w2v_model[word]
        wordsVectors.append(vector)
    except Exception as e:
        notFoundwords.append(word)
        wordsVectors.append(np.random.uniform(-0.25,0.25,300))  

del w2v_model
wordsVectors = np.asarray(wordsVectors)

print('The number of missing words is ', len(notFoundwords))

We will use the pickle library to save the words vectors matrix and the corresponding vocabulary.

pickle_file = os.path.join('/Users/MacBook/Documents/MLTraining/DECEPTIVE_REVIEWS_ON_HOTEL/', 'save.pickle')

try:
    f = open(pickle_file, 'wb')
    save = {
        'wordsVectors': wordsVectors,
        'vocabulary': vocabulary,
        'notFoundwords': notFoundwords
    }
    
    pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)
    raise

statinfo = os.stat(pickle_file)
print('Compressed pickle size:', statinfo.st_size)

The Ids Matrix

We will take each file and transform it into an ids vector that represent the indexes of its words in the vocabulary. Based on the ids vectors, we will buid our ids Matrix. At the same time, we will also build the corresponding cross-entropy list.

MAX_SEQ_LENGTH = 160
def convertFileToIndexArray(filePath):
    doc = np.zeros(MAX_SEQ_LENGTH, dtype='int32')
    with open(filePath, "r") as f:
        lines=f.readlines()
        indexCounter = 0
        for line in lines:
            cleanedLine = clean_str(line)
            cleanedLine = cleanedLine.strip()
            cleanedLine = cleanedLine.lower()
            words = cleanedLine.split(' ')
            for word in words:
                doc[indexCounter] = vocabulary.index(word)
                indexCounter = indexCounter + 1
                if (indexCounter >= MAX_SEQ_LENGTH):
                    break
            if (indexCounter >= MAX_SEQ_LENGTH):
                break
    return doc

totalFiles = len(truthful_reviews_link) + len(deceptive_reviews_link)
idsMatrix = np.ndarray(shape=(totalFiles, MAX_SEQ_LENGTH), dtype='int32')
labels = np.ndarray(shape=(totalFiles, 2), dtype='int32')

counter = 0
for filePath in truthful_reviews_link:
    idsMatrix[counter] = convertFileToIndexArray(filePath)
    counter = counter + 1

for filePath in deceptive_reviews_link:
    idsMatrix[counter] = convertFileToIndexArray(filePath)
    counter = counter + 1
    
labels[0:len(truthful_reviews_link)] = np.array([1, 0])
labels[len(truthful_reviews_link):totalFiles] = np.array([0, 1])

print('The shape of the ids matrix is ', idsMatrix.shape)
print('The shape of the labels is ', labels.shape)

We will now first shuffle the ids matrix and the labels. After that, we will create a training set, a validation set and a test set. We will use 80% for the training set, 10% for the validation set and also 10% for the test set. Finally we will use pickle to save them.

"""
Create a training set, a validation set and a test set after mixing the data
80% for the training set
10% for the validation set
10% for the test set
"""
size = idsMatrix.shape[0]
testSize = int(size * 0.1)
shuffledIndex = np.random.permutation(size)
testIndexes = shuffledIndex[0:testSize]
validationIndexes = shuffledIndex[testSize:2*testSize]
trainIndexes = shuffledIndex[2*testSize:size]

test_data = idsMatrix[testIndexes]
test_labels = labels[testIndexes]

validation_data = idsMatrix[validationIndexes]
validation_labels = labels[validationIndexes]

train_data = idsMatrix[trainIndexes]
train_labels = labels[trainIndexes]

print('train data shape ', train_data.shape)
print('train labels shape ', train_labels.shape)
print('validation data shape ', validation_data.shape)
print('validation labels shape ', validation_labels.shape)
print('test data shape ', test_data.shape)
print('test labels shape ', test_labels.shape)

pickle_file = os.path.join('/Users/MacBook/Documents/MLTraining/DECEPTIVE_REVIEWS_ON_HOTEL/', 'data_saved.pickle')

try:
    f = open(pickle_file, 'wb')
    save = {
        'train_data': train_data,
        'train_labels': train_labels,
        'validation_data': validation_data,
        'validation_labels': validation_labels,
        'test_data': test_data,
        'test_labels': test_labels
    }
    
    pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
    f.close()
except Exception as e:
    print('Unable to save data to', pickle_file, ':', e)
    raise

statinfo = os.stat(pickle_file)
print('Compressed pickle size:', statinfo.st_size)

In the next part, we will implement the model in tensorflow and test its performance.

The full code is available on Github.

References

Document representation and feature combination for deceptive spam review detection

Perform sentiment analysis with LSTMs, using TensorFlow

Implementing a CNN for Text Classification in TensorFlow

You May Also Like

About the Author: Miguel KAKANAKOU

1 Comment

Leave a Reply

WP to LinkedIn Auto Publish Powered By : XYZScripts.com