Classification in Python with Scikit-Learn and Pandas
Classification is a large domain in the field of statistics and machine learning. Generally, classification can be broken down into two areas:
- Binary classification, where we wish to group an outcome into one of two groups.
- Multi-class classification, where we wish to group an outcome into one of multiple (more than two) groups.
In this post, the main focus will be on using a variety of classification algorithms across both of these domains, less emphasis will be placed on the theory behind them.
We can use libraries in Python such as scikit-learn for machine learning models, and Pandas to import data as data frames.
These can easily be installed and imported into Python with
$ python3 -m pip install sklearn $ python3 -m pip install pandas
import sklearn as sk import pandas as pd
For binary classification, we are interested in classifying data into one of two binary groups - these are usually represented as 0's and 1's in our data.
We will look at data regarding coronary heart disease (CHD) in South Africa. The goal is to use different variables such as tobacco usage, family history, ldl cholesterol levels, alcohol usage, obesity and more.
A full description of this dataset is available in the "Data" section of the Elements of Statistical Learning website.
The code below reads the data into a Pandas data frame, and then separates the data frame into a
y vector of the response and an
X matrix of explanatory variables:
import pandas as pd import os os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification') heart = pd.read_csv('SAHeart.csv', sep=',', header=0) heart.head() y = heart.iloc[:,9] X = heart.iloc[:,:9]
When running this code, just be sure to change the file system path on line 4 to suit your setup.
Logistic Regression is a type of Generalized Linear Model (GLM) that uses a logistic function to model a binary variable based on any kind of independent variables.
To fit a binary logistic regression with
sklearn, we use the LogisticRegression module with
multi_class set to "ovr" and fit
We can then use the
predict method to predict probabilities of new data, as well as the
score method to get the mean prediction accuracy:
import sklearn as sk from sklearn.linear_model import LogisticRegression import pandas as pd import os os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification') heart = pd.read_csv('SAHeart.csv', sep=',',header=0) heart.head() y = heart.iloc[:,9] X = heart.iloc[:,:9] LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X, y) LR.predict(X.iloc[460:,:]) round(LR.score(X,y), 4)
Support Vector Machines
Support Vector Machines (SVMs) are a type of classification algorithm that are more flexible - they can do linear classification, but can use other non-linear basis functions. The following example uses a linear classifier to fit a hyperplane that separates the data into two classes:
import sklearn as sk from sklearn import svm import pandas as pd import os os.chdir('/Users/stevenhurwitt/Documents/Blog/Classification') heart = pd.read_csv('SAHeart.csv', sep=',',header=0) y = heart.iloc[:,9] X = heart.iloc[:,:9] SVM = svm.LinearSVC() SVM.fit(X, y) SVM.predict(X.iloc[460:,:]) round(SVM.score(X,y), 4)
Random Forests are an ensemble learning method that fit multiple Decision Trees on subsets of the data and average the results. We can again fit them using
sklearn, and use them to predict outcomes, as well as get mean prediction accuracy:
import sklearn as sk from sklearn.ensemble import RandomForestClassifier RF = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0) RF.fit(X, y) RF.predict(X.iloc[460:,:]) round(RF.score(X,y), 4)
Neural Networks are a machine learning algorithm that involves fitting many hidden layers used to represent neurons that are connected with synaptic activation functions. These essentially use a very simplified model of the brain to model and predict data.
sklearn for consistency in this post, however libraries such as Tensorflow and Keras are more suited to fitting and customizing neural networks, of which there are a few varieties used for different purposes:
import sklearn as sk from sklearn.neural_network import MLPClassifier NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1) NN.fit(X, y) NN.predict(X.iloc[460:,:]) round(NN.score(X,y), 4)
While binary classification alone is incredibly useful, there are times when we would like to model and predict data that has more than two classes. Many of the same algorithms can be used with slight modifications. Additionally, it is common to split data into training and test sets. This means we use a certain portion of the data to fit the model (the training set) and save the remaining portion of it to evaluate to the predictive accuracy of the fitted model (the test set). There's no official rule to follow when deciding on a split proportion, though in most cases you'd want about 70% to be dedicated for the training set and around 30% for the test set. To explore both multi-class classifications, as well as training/test data, we will look at another dataset from the Elements of Statistical Learning website. This is data used to determine which one of eleven vowel sounds were spoken:
import pandas as pd vowel_train = pd.read_csv('vowel.train.csv', sep=',', header=0) vowel_test = pd.read_csv('vowel.test.csv', sep=',', header=0) vowel_train.head() y_tr = vowel_train.iloc[:,0] X_tr = vowel_train.iloc[:,1:] y_test = vowel_test.iloc[:,0] X_test = vowel_test.iloc[:,1:]
We will now fit models and test them as is normally done in statistics/machine learning: by training them on the training set and evaluating them on the test set. Additionally, since this is multi-class classification, some arguments will have to be changed within each algorithm:
import pandas as pd import sklearn as sk from sklearn.linear_model import LogisticRegression from sklearn import svm from sklearn.ensemble import RandomForestClassifier from sklearn.neural_network import MLPClassifier vowel_train = pd.read_csv('vowel.train.csv', sep=',',header=0) vowel_test = pd.read_csv('vowel.test.csv', sep=',',header=0) y_tr = vowel_train.iloc[:,0] X_tr = vowel_train.iloc[:,1:] y_test = vowel_test.iloc[:,0] X_test = vowel_test.iloc[:,1:] LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(X_tr, y_tr) LR.predict(X_test) round(LR.score(X_test,y_test), 4) SVM = svm.SVC(decision_function_shape="ovo").fit(X_tr, y_tr) SVM.predict(X_test) round(SVM.score(X_test, y_test), 4) RF = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0).fit(X_tr, y_tr) RF.predict(X_test) round(RF.score(X_test, y_test), 4) NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(150, 10), random_state=1).fit(X_tr, y_tr) NN.predict(X_test) round(NN.score(X_test, y_test), 4)
Although the implementations of these models were rather naive (in practice there are a variety of parameters that can and should be varied for each model), we can still compare the predictive accuracy across the models. This will tell us which one is the most accurate for this specific training and test dataset:
|Support Vector Machine||64.07%|
This shows us that for the vowel data, an SVM using the default radial basis function was the most accurate.
To summarize this post, we began by exploring the simplest form of classification: binary. This helped us to model data where our response could take one of two states. We then moved further into multi-class classification, when the response variable can take any number of states. We also saw how to fit and evaluate models with training and test sets. Furthermore, we could explore additional ways to refine model fitting among various algorithms.Reference: stackabuse.com