This is an exact duplication of a blog post by Ben Alex Keen
I have just added a few comments to embellish. The reason I am posting it is so that I can go back and refer in case I forget. Most people who write about these topics assume that everyone else knows what they are talking about. I don’t blame them because that’s what happens when an old fool decides to learn something about machine learning. I never really understood the concept behind the function train_test_split(). Ben explains this very succinctly. Thank You!!!
scikit-learn provides a helpful function for partitioning data, which splits out your data into a training set and a test set.
In this post we’ll show how it works.
We’ll create some fake data and then split it up into test and train.
Let’s imagine our data is modelled as for

# That simply means the value of y is “1” if X0 + X1 <= 10 else if its greater than zero the value of y is “0“. I think that was common sense.
import pandas as pd
import numpy as np
np.random.seed(10)
X = pd.DataFrame({
'x_0': np.random.randint(1, 11, 20), # Generate random number
'x_1': np.random.randint(1, 11, 20) }) # Min-1, Max-10, Total-20
y = (X['x_0'] + X['x_1']).map(lambda x: 1 if x <= 10 else 0)
# The algebraic function written in python
So we start with 20 samples of random numbers between 1 and 10, this is our feature set.


Notice how the values align.
10+5 > 10 therefore y=0
5+2 < 10 therefore y=1 and so on and so forth
From these, we want to get a test and training set of data so we can use our train_test_split
.
We provide the proportion of data to use as a test set and we can provide the parameter random_state
, which is a seed to ensure repeatable results.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
So what’s this X_train, X_test, y_train and y_test. Well, from the function being described above, “X” describes the data that might influence the value of “y“. We split the data into two sets “train” and “test“. The “training” data helps the algorithm to learn about the data. i.e be able to predict the value of “y” based on the characteristics of the dataset described by the columns in “X”. Once the training is done, we will need to validate the training by comparing it with the test data.
Notice that we split the dataset into a 3:1 ratio. Therefore
len(X_train): 15 & len(y_train): 15 vs. len(X_test): 5 & len(y_test): 5
Leave a Reply