DECEPTIVE SPAM REVIEW DETECTION WITH CNN USING TENSORFLOW (PART 2)

The full code is available on Github.

In this tutorial we will implement a model similar to the SCNN model of Luyang Li’sDocument representation and feature combination for deceptive spam review detection. In that paper, the SCNN model apply convolutional neural network (CNN) technique to detect deceptive spam review. A Deceptive opinion spam is a review with fictitious opinions which is deliberately written to sound authentic. Deceptive spam review detection can then be thought as of the exercise of taking a review and determining whether is a spam or a truth.

In the Part 1 of this tutorial, we have focused on the data preprocessing phase. In this second phase, we will implement the model in Tensorflow.

I am assuming that you have read and understood the Part 1 of this tutorial. If not, I recommend to first read over DECEPTIVE SPAM REVIEW DETECTION WITH CNN USING TENSORFLOW (PART 1).

Model

In the paper of Luyang Li’s, the architecture of the SCNN model is as follows:

The SCNN model consists of two convolutional layers and a softmax classification layer. The first convolutional layer, called the sentence convolution, is used to produce sentence vector representations from word representations. We can see the detailed architecture of the Sentence Convolution in the paper of Yafeng Ren Neural networks for deceptive opinion spam detection: An empirical study. The architecture looks as follows:

From this architecture, we can see that three convolutional filters are ised to produce sentence representation. In this tutorial, we will also use three convolutional filters of respective widths 3, 4 and 5.

The second convolutional layer of the SCNN model is called the document convolution. It transforms sentence vectors into a document vector. Given a document with m sentences, we use the sentence vectors s1, s2, ..., sm as inputs and we get the document vector representation as output.

Finally the softmax classification layer use the document vector representation as features to identify deceptive spam review.

Implementation

To allow various hyperparameters configurations we put our code into a SCNN_MODEL class, generating the model graph in the init function.

import tensorflow as tf
import numpy as np

class SCNN_MODEL(object):
    '''
        A SCNN model for Deceptive spam reviews detection. 
        Use google word2vec.
    '''
    
    def __init__(self, sentence_per_review, words_per_sentence, wordVectors, embedding_size, 
                filter_widths_sent_conv, num_filters_sent_conv, filter_widths_doc_conv, num_filters_doc_conv, 
                num_classes, l2_reg_lambda=0.0):
        # Implementation...

To instantiate the class we then pass the following arguments:

  • sentence_per_review : The number of sentences per review (We set that to 16).
  • words_per_sentence : The number of words per sentence (We set that to 10)
  • wordVectors : The Word2Vec model
  • embedding_size : The size of each word vector representation. 300 in our case.
  • filter_widths_sent_conv : An array the contains the widths of the convolutional filters for the sentence convolution layer. We use [3, 4, 5]
  • num_filters_sent_conv : The number of convolutional filters for the sentence convolution layer. We use 100 filters.
  • filter_widths_doc_conv : An array the contains the widths of the convolutional filters for the document convolution layer. We use [3, 4, 5]
  • num_filters_doc_conv : The number of convolutional filters for the document convolution layer. We use 100.
  • num_classes : The number of classes. 2 in our case.

Input Placeholders

We define four input placeholders for our the SCNN model.

#Placeholders for input, output and dropout
self.input_x = tf.placeholder(tf.int32, shape=(None, sentence_per_review * words_per_sentence), name='input_x')
self.input_y = tf.placeholder(tf.int32, shape=(None, num_classes), name='input_y')
self.dropout = tf.placeholder(tf.float32, name='dropout_keep_prob')
self.input_size = tf.placeholder(tf.int32, name='input_size')

The placeholders self.input_x and self.input_y are for the input data and input labels. We use dropout during the training of the model in order to avoid overfitting. We pass the probability of keeping a neuron to the model with the placeholder self.dropout. Finally to make easy the reshaping of the data in the model, we pass the size of the input data to the model with the placeholder self.input_size.

Reconstruct input vector representation

We split each review in the input data into sentences. After that we reconstruct the input vector representation by replacing each word in sentence by its word2Vec representation. In order to get the word vectors, we can use Tensorflow’s embedding lookup function. This function takes in two arguments, one for the embedding matrix (the wordVectors matrix we build so far), and one for the ids of each of the words(the input data in our case).

#Reshape the input_x to [input_size*sentence_per_review, words_per_sentence, embedding_size, 1]
with tf.name_scope('Reshape_Input_X'):
    self.x_reshape = tf.reshape(self.input_x, [self.input_size*sentence_per_review, words_per_sentence])
    self.x_emb = tf.nn.embedding_lookup(wordVectors, self.x_reshape)
    shape = self.x_emb.get_shape().as_list()
    self.x_emb_reshape = tf.reshape(self.x_emb, [self.input_size*sentence_per_review, shape[1], shape[2], 1])
    #Cast self.x_emb_reshape from Float64 to Float32
    self.data = tf.cast(self.x_emb_reshape, tf.float32)

Sentence convolution layer

For each filter width contains in the array filter_widths_sent_conv, we create a convolution, max pooling and a tanh activation. We concatenate the result of each filter width and add dropout to that. We obtain as output the vector representation of each of the sentence.

# Create a convolution + maxpool layer + tanh activation for each filter size
conv_outputs = []
for i, filter_size in enumerate(filter_widths_sent_conv):
    with tf.name_scope('sent_conv-maxpool-tanh-%s' % filter_size):
        # Convolution Layer
        filter_shape = [filter_size, embedding_size, 1, num_filters_sent_conv]
        W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name='W')
        b = tf.Variable(tf.constant(0.1, shape=[num_filters_sent_conv]), name='b')
        conv = tf.nn.conv2d(
            self.data,
            W,
            strides=[1, 1, 1, 1],
            padding='VALID',
            name='conv')
        h = tf.nn.bias_add(conv, b)
        # Maxpooling over the outputs
        pooled = tf.nn.max_pool(
            h,
            ksize=[1, words_per_sentence - filter_size + 1, 1, 1],
            strides=[1, 1, 1, 1],
            padding='VALID',
            name='pool')
        #Apply tanh Activation
        h_output = tf.nn.tanh(pooled, name='tanh')
        conv_outputs.append(h_output)
        
# Combine all the outputs
num_filters_total = num_filters_sent_conv * len(filter_widths_sent_conv)
self.h_combine = tf.concat(conv_outputs, 3)
self.h_combine_flat = tf.reshape(self.h_combine, [-1, num_filters_total])

# Add dropout
with tf.name_scope('dropout'):
    self.h_drop = tf.nn.dropout(self.h_combine_flat, self.dropout)

Reconstruct the output of the document convolution layer

We reshape the output of the sentence convolution layer by putting together sentence vector that belong to the same review. We then obtain the input of the document convolution layer.

#Reshape self.h_drop for the input of the document convolution layer
self.conv_doc_x = tf.reshape(self.h_drop, [self.input_size, sentence_per_review, num_filters_total])
self.conv_doc_input = tf.reshape(self.conv_doc_x, [self.input_size, sentence_per_review, num_filters_total, 1])

Document convolution layer

The document convolution layer is similar to the sentence convolution layer. The output of the document convolution layer is the document vector representation.

Classifier

On the top of the document convolution layer, we add a softmax classification layer to identify the deceptive spam review.

#Softmax classification layers for final score and prediction
with tf.name_scope('output'):
    W = tf.get_variable(
        'W',
        shape=[num_filters_total_doc, num_classes],
        initializer=tf.contrib.layers.xavier_initializer())
    b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name='b')
    l2_loss += tf.nn.l2_loss(W)
    l2_loss += tf.nn.l2_loss(b)
    self.scores = tf.nn.xw_plus_b(self.doc_rep, W, b, name='scores')
    self.predictions = tf.argmax(self.scores, 1, name='predictions')

Loss

We define a standard cross entropy loss with a softmax layer and a l2 regularizer on top of the final prediction values. For the optimizer, we’ll use Adam and the default learning rate of 0.001.

# Compute Mean cross-entropy loss
with tf.name_scope('loss'):
    losses = tf.nn.softmax_cross_entropy_with_logits(logits=self.scores, labels=self.input_y)
    self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

Accuracy

we define correct prediction and accuracy metrics to track how the model is doing. The correct prediction formulation works by looking at the index of the maximum value of the 2 output values, and then seeing whether it matches with the training labels.

# Compute Accuracy
with tf.name_scope('accuracy'):
    correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
    self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, tf.float32), name='accuracy')

Training

We first load the preprocessed data (the vocabulary, the word2vec model, the training data, the training labels, the validation data and validation labels).

# Load vocabulary and the word2vec model
pickle_file = 'save.pickle'
with open(pickle_file, 'rb') as f :
    save = pickle.load(f)
    wordsVectors = save['wordsVectors']
    vocabulary = save['vocabulary']
    del save  # hint to help gc free up memory
print('Vocabulary and the word2vec loaded')
print('Vocabulary size is ', len(vocabulary))
print('Word2Vec model shape is ', wordsVectors.shape)

#Load training data, training labels, validation data, validation labels
pickle_file = 'data_saved.pickle'
with open(pickle_file, 'rb') as f :
    save = pickle.load(f)
    train_data = save['train_data']
    train_labels = save['train_labels']
    validation_data = save['validation_data']
    validation_labels = save['validation_labels']
    del save  # hint to help gc free up memory
print('train data shape ', train_data.shape)
print('train labels shape ', train_labels.shape)
print('validation data shape ', validation_data.shape)
print('validation labels shape ', validation_labels.shape)

The basic idea of the training loop is that we first define a Tensorflow session. Then, we load in a batch of reviews and their associated labels. Next, we call the session’s run function. This function has two arguments. The first is called the fetches argument. It defines the value we’re interested in computing. The second argument is where we input our feed_dict. This data structure is where we provide inputs to all of our placeholders. We need to feed our batch of reviews and our batch of labels. This loop is then repeated for a set number of training iterations. The full code for the training is available on Gitbub.

Visualizing Results in TensorBoard

The training script writes summaries to an output directory, and by pointing TensorBoard to that directory we can visualize the graph and the summaries we created.

tensorboard –logdir /PATHTOSUMMARIES/

Evaluate the model

Now we will evaluate the model on the test data set. We will load the test dataset with pickle.

#Load test data and test labels
pickle_file = 'data_saved.pickle'
with open(pickle_file, 'rb') as f :
    save = pickle.load(f)
    test_data = save['test_data']
    test_labels = save['test_labels']
    del save  # hint to help gc free up memory
print('test data shape ', test_data.shape)
print('test labels shape ', test_labels.shape)

We can then load the saved meta graph and restore variablees from our checkpoint file. After we can run the session and compute the accuracy on the test data.

print("\nEvaluating...\n")

# Evaluation
# ==================================================
CHECKPOINT_DIR='/Users/MacBook/Documents/MLTraining/DECEPTIVE_REVIEWS_ON_HOTEL/runs/1503225456/checkpoints/'
checkpoint_file= tf.train.latest_checkpoint(CHECKPOINT_DIR)
graph = tf.Graph()
with graph.as_default():
    sess = tf.Session()
    with sess.as_default():
        # Load the saved meta graph and restore variables
        saver = tf.train.import_meta_graph("{}.meta".format(checkpoint_file))
        saver.restore(sess, checkpoint_file)
        
        # Get the placeholders from the graph by name
        input_x = graph.get_operation_by_name("input_x").outputs[0]
        input_y = graph.get_operation_by_name("input_y").outputs[0]
        dropout_keep_prob = graph.get_operation_by_name("dropout_keep_prob").outputs[0]
        input_size = graph.get_operation_by_name("input_size").outputs[0]
        
        # Tensors we want to evaluate
        accuracy = graph.get_operation_by_name("accuracy/accuracy").outputs[0]
        
        #Compute the accuracy on the test data
        acc = sess.run(accuracy, {input_x:test_data, input_y:test_labels, dropout_keep_prob:1.0, input_size:test_labels.shape[0]})
        
        print('The test accuracy is ', acc)

Here I get an accuracy of 80%. We could get a better accuracy by playing with the model hyperparameters.

The full code is available on Github.

Thanks for reading. Please leave feedback and questions in the comments!

References

Document representation and feature combination for deceptive spam review detection

Implementing a CNN for Text Classification in TensorFlow

Perform sentiment analysis with LSTMs, using TensorFlow

You May Also Like

About the Author: Miguel KAKANAKOU

Leave a Reply

WP to LinkedIn Auto Publish Powered By : XYZScripts.com