Splitting Data For Machine Learning With SCIKIT-LEARN


This is an exact duplication of a blog post by Ben Alex Keen

I have just added a few comments to embellish. The reason I am posting it is so that I can go back and refer in case I forget. Most people who write about these topics assume that everyone else knows what they are talking about. I don’t blame them because that’s what happens when an old fool decides to learn something about machine learning.  I never really understood the concept behind the function train_test_split(). Ben explains this very succinctly. Thank You!!!

scikit-learn provides a helpful function for partitioning data, which splits out your data into a training set and a test set.

In this post we’ll show how it works.

We’ll create some fake data and then split it up into test and train.

Let’s imagine our data is modelled as for

# That simply means the value of y is “1” if X0 + X1 <= 10 else if its greater than zero the value of y is “0“. I think that was common sense.


import pandas as pd 
import numpy as np 
np.random.seed(10) 
X = pd.DataFrame({ 
'x_0': np.random.randint(1, 11, 20),   # Generate random number
'x_1': np.random.randint(1, 11, 20) }) # Min-1, Max-10, Total-20
y = (X['x_0'] + X['x_1']).map(lambda x: 1 if x <= 10 else 0)
# The algebraic function written in python

 So we start with 20 samples of random numbers between 1 and 10, this is our feature set.


Notice how the values align. 
10+5 > 10 therefore y=0
5+2 < 10 therefore y=1 and so on and so forth

From these, we want to get a test and training set of data so we can use our train_test_split.

We provide the proportion of data to use as a test set and we can provide the parameter random_state, which is a seed to ensure repeatable results.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

So what’s this X_train, X_test, y_train and y_test. Well, from the function being described above, “X” describes the data that might influence the value of “y“. We split the data into two sets “train”  and “test“.  The “training” data helps the algorithm to learn about  the data. i.e be able to predict the value of “y” based on the characteristics of the dataset described by the columns in “X”.  Once the training is done, we will need to validate the training by comparing it with the test data.

Notice that we split the dataset into a 3:1 ratio.  Therefore

len(X_train): 15 & len(y_train): 15 vs. len(X_test): 5 & len(y_test): 5

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s