Support Vector Machine (SVM) is a popular technique in machine learning for classification and regression. It is highly preferred by many as it produces significant accuracy with less computation power. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin have published a pratical guide for SVM novices to obtain acceptable result with this technique. You can download the paper here: A Practical Guide to Support Vector Classification. In this tutorial, we will test the practical guide procedure on the Astroparticle dataset using Tensorflow.
Description of the dataset
The Astroparticle dataset is a binary classification dataset. A preprocessed version of the dataset can be downloaded here. The dataset contains 4 features and is splitted into training set and test set. The training set contains 3089 rows and the the test set contains 4000 rows.
The variable to predict is a binary variable with values 0 or 1. In order to match the format expected by the SVM algorithm, we change the value 0 to -1. We perform this change in both training set and test set.
Exploratory Data Analysis
We will start our data exploration by checking if we have a balanced dataset or not. We plot the bar chart of the labels in the training set and the test set.
We don’t have equal label distribution in the training set. We have two times labels 1 than labels -1. Thus during the training, the accuracy will not be a good metric to measure our model performance. We decide to use the F1-Score which is the harmonic mean of the precision and the recall.
As we have fiew number of features (only 4), we decide to see how each feature behave depending on the label value.
We define the above function that use SeaBorn FacetGrid and violinplot to plot the behavior of input the feature name based on the label value.
We see that the featureOne tends to take relatively small value around 0 when the label is -1 and when the label is 1 the values of this feature are relatively high up from 25 to 150. The featureOne is then a useful feature for our model as it is correlated to the target variable.
We perform the same analysis for the remaining features: featureTwo, featureThree and featureFour. Please check the full analysis here: Pratical-Guide-To-SVM.
We will now check if any two features are highly correlated (pearson correlation > 0.90). In the case we have any two such features, we will remove one or create a new feature from them. We use Seaborn heatmap to plot the correlation map of our features:
The correlation map shows that we don’t have any high correlation (max value is 0.68) between our features.
Our next and final step will be to standardize our dataset. The goal of the standardization is to help to rescale our dataset so that the mean will be 0 and the standard deviation will be 1. In order to visualize the standardization, we will plot the bar chart of the mean and the standard deviation before and after.
We see that our data is not standardize because each feature has different mean and standard deviation. We use sklearn.preprocessing.StandardScaler to standardize our dataset. The bar chart of the mean and the standard deviation looks as follow after the standardization:
The means of our features are now around 0 and the standard deviation is 1.
In the next part, we will implement different SVM model and search for optimal hyper-parameters to obtain acceptable result from the model.
The full code is available on Github.
Thanks for reading. Please leave feedback and questions in the comments!