Scikit-Learn is one of the most widely-used Machine Learning library in Python. It's optimized and efficient - and its high-level API is simple and easy to use.
Scikit-Learn has a plethora of convenience tools and methods that make preprocessing, evaluating and other painstaking processes as easy as calling a single method - and splitting data between a training and testing set is no exception.
Generally speaking, the rule-of-thumb for splitting data is 80/20 - where 80% of the data is used for training a model, while 20% is used for testing it. This depends on the dataset you're working with, but an 80/20 split is very common and would get you through most datasets just fine.
In this guide - we'll take a look at how to use the
split_train_test() method in Scikit-Learn, and how to configure the parameters so that you have control over the splitting process.
Assuming it isn't already installed - Scikit-Learn can easily be installed via
$ pip install scikit-learn
Once installed, you can import the library itself via:
In most cases, people avoid importing the entire library, as it's pretty vast, and import specific classes or modules that they'll be using specifically.
Note: This tends to mean that people have a hefty import list when using Scikit-Learn.
<h3 id="importanceoftrainingandtestingsets">Importance of Training and Testing Sets</h3>
The most common procedure when training a (basic) model in Machine Learning follows the same rough outline:
- Acquiring and processing data which we'll feed into a model.
Scikit-learn has various datasets to be loaded and used for training the model (iris, diabetes, digits…), mainly for benchmarking/learning.
- Splitting sets into training and test sets
- Building a model and defining the architecture
- Compiling the model
- Training the model
- Verifying the results
The training set is a subset of the whole dataset and we generally don't train a model on the entirety of the data. In non-generative models, a training set usually contains around 80% of the main dataset's data. As the name implies, it is used for training the model. This procedure is also referred to as fitting the model. There are exceptions to this rule, though. For instance, when training Generative Adversarial Networks (GANs) that generate images - how do you test the results? They're highly subjective in some cases, as they represent new instances that were never seen before. In most generative models, at least as of now, a human is typically required to judge the outputs, in which cases, a test set is totally redundant. Additionally, sometimes you need more or less than 20% for testing, and if you're using techniques such as cross-validation, you might want to have a tiny bit less testing data as to not "take away" too much from the training data. For instance, if you have 1.000.000 instances in a dataset, holding out just 5% for a testing set amounts to 50.000 instances, which is most likely more than enough for any model to be tested on. The test set is a subset of the whole dataset, and is used to evaluate the model and check how well it learnt from the training set.
The model mustn't interact or see the test set before evaluating. The data must be unknown when first evaluating, otherwise it's not really testing the model.
What About Validation Sets?
Validation sets are a common sight in professional and academic models. Validation sets are taken out of the training set, and used during training to validate the model's accuracy approximately. The testing set is fully disconnected until the model is finished training - but the validation set is used to validate it during training.
Note: The validation set isn't used for training, and the model doesn't train on the data. It just validates the current epoch. This way - it indirectly trains on the data, as it does affect its prior beliefs, so the validation set can't be used for testing.
Similar to how you'll learn more about your own knowledge if you hear it's incorrect - even if you don't know why. This is why validation sets <em>approximate</em> a models accuracy, and testing sets are still required even when using a validation set.
They help with approximating a model's actual performance during training, so you don't end up with an illusory overfit model without realizing it after testing it via a test set. You can also use validation sets to tune models, and approximately evaluate their ability without exposing them to a testing set.
Deep Learning frameworks such as Keras can display a
val_accuracy besides your regular training
accuracy as a good sign of overfitting. If they start diverging, your model is overfitting during training, and you don't need to waste time training it further when they diverge enough. Additionally, callbacks such as
EarlyStopping can be used to automatically stop a model's training if the
val_accuracy doesn't improve after
n epochs, even if the
accuracy is increasing.
Creating a validation set is easy.
You can, quite literally, just run the
train_test_split() method on the training set, which was already split by the
train_test_split() method and extract a validation set from it. Alternatively, you can manually create a validation set.
The validation set size is typically split similar to a testing set - anywhere between 10-20% of the training set is typical. For huge datasets, you can do much lower than this, but for small datasets, you can take out too much, making it hard for the model to fit the data in the training set.
In the proceeding sections, we'll also take out a validation set using the same
Scikit-Learn's datasets Module
Several clean and popular datasets are available built-into Scikit-Learn, typically used during leearning and for benchmarking models on simple tasks. If you've ever read resources regarding Machine Learning in Python - you've probably seen some of these most popular datasets:
- Iris - set of 3 classes (flowers), with 50 samples per class, used for classification.
- Diabetes - set with a total of 442 samples, used for regression.
- Digits - set of 10 classes (hand-written digits), with ~180 samples per class, used for classification.
- Wine - set of 3 classes (of wine), with total of 178 samples, used for classification.
Each of these datasets can be loaded in through the
datasets module with their respective function:
from sklearn import datasets X_iris, y_iris = datasets.load_iris(return_X_y=True) X_diabetes, y_diabetes = datasets.load_diabetes(return_X_y=True) X_digits, y_digits = datasets.load_digits(return_X_y=True) X_wine, y_wine = datasets.load_wine(return_X_y=True)
Alternatively, you can load in the specific functions instead:
from sklearn.datasets import load_iris from sklearn.datasets import load_diabetes from sklearn.datasets import load_digits from sklearn.datasets import load_wine X_iris, y_iris = load_iris(return_X_y=True) X_diabetes, y_diabetes = load_diabetes(return_X_y=True) X_digits, y_digits = load_digits(return_X_y=True) X_wine, y_wine = load_wine(return_X_y=True)
By default, these methods return a
Bunch object, containing the data and the targets (data and their clases), however, if you set the
return_X_y argument to
True, a tuple of
(data, targets) is returned, denoting the data you'll be training on and the target classes you want your classifier/regression model to hit.
Splitting a Dataset with train_test_split()
train_test_split() method resides in the
from sklearn.model_selection import train_test_split
There are a couple of arguments we can set while working with this method - and the default is very sensible and performs an 75/25 split. In practice, all of Scikit-Learn's default values are fairly reasonable and set to serve well for most tasks. However, it's worth noting what these defaults are, in the cases they don't work that well.
The main two arguments are
test_size, where their values range between
1 and their sum has to be
1. Their values denote the percentage proportion of the dataset, so even if you define just one, such as
test_size is equal to
1 - train_size, and vice versa.
Setting the train_size and test_size Arguments
This is the most common approach, which, leaves us with 4 subsets -
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y) print(X_train.shape) print(X_test.shape) print(y_train.shape) print(y_test.shape)
Without setting either
test_size the default values kick in, setting the
0.25, and a complementary (
(112, 4) (38, 4) (112,) (38,)
As you can see, train and test sets are split 75%/25%, as there are 112 instances in the
X_train set, and 38 instances in the
Some other split proportions are: 80%/20% (very common), 67%/33% and more rarely 50%/50%.
Setting any of these boils down to defining either one or both of the arguments in the
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8) X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, test_size=0.2) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
All three of these splits would result in the same split of:
(120, 4) (30, 4) (120,) (30,)
Creating a Validation Set with train_test_split()
Validation sets are really useful during training and make your life as a Data Scientist significantly easier.
Whenever possible, try to use a validation set.
There is no built-in function to extract a validation set from a training set, however, since this boils down to just splitting it like before - why not use the same
Let's re-use it to get our hands on a validation set, taking 10% of the data from the training set:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8) X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, train_size=0.9) print(X_train.shape) print(X_test.shape) print(X_valid.shape)
This won't create a 70%-20%-10% split, as we're splitting 10% from the already split
X_train so we're actually ending up with a 72%-20%-8% split here:
(108, 4) (30, 4) (12, 4)
To account for this, you can either manually set a different number, expecting this, or you could define your proportions upfront, and calculate an updated split to reference the original size, instead of the already truncated size.
To split the data proportionally into a training, testing and validation set - we need to set the
test_size argument on the second function call to:
tests = validationr/(trainr+testr)
$$ Let's load in the Diabetes dataset, as it has more instances (due to rounding, small datasets ofetntimes produce slightly different splits even with same ratios):
from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split X, y = load_diabetes(return_X_y=True) print(X.shape)
Say we're aiming for an 80%/10%/10% split - we'd want to have
45 instances respectively. Let's define these rations and split the dataset into a training, testing and validation set with the
from sklearn.datasets import load_diabetes from sklearn.model_selection import train_test_split train_ratio = 0.80 test_ratio = 0.10 validation_ratio = 0.10 X, y = load_diabetes(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_ratio) X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size=validation_ratio/(train_ratio+test_ratio)) print(X_train.shape) print(X_test.shape) print(X_valid.shape)
This results in:
(352, 10) (45, 10) (45, 10)
Awesome! Our dataset has successfully been split into three sets, which we can now feed into a model and perform validation during training to tune the hyperparameters.
Sometimes, there's a different numbers of samples for each class in a dataset. Say, one class has 100 samples, the second one has 50, the third one 30, etc. Splitting without this in mind creates an issue when you're training a classification model (though, regression models don't suffer from this).
It is best to somehow split the set, so that it preserves the proportions of the classes. This is a stratified split.
Luckily, the method
train_test_split has an argument called
stratify which takes an array which defines the number of samples by class, when splitting, to stay proportional:
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
In a lot of cases, you can simply use the
y NumPy array from your dataset for a good
stratify split array. This ensures that your model can fight the lack of balance between instances of classes and becomes less biased towards some.
In this guide, we got familiar with some of the Scikit-Learn library and its
datasets module. You've learned what training, testing and validation sets are, where they're applied and the benefits of validating your models.
We've taken a look at how to employ the
train_test_split() method to split your data into a training and testing set, as well as how to separate a validation set, dynamically perserving the ratios of these sets.