# Splitting Data For Machine Learning With SCIKIT-LEARN

This is an exact duplication of a blog post by Ben Alex Keen

I have just added a few comments to embellish. The reason I am posting it is so that I can go back and refer in case I forget. Most people who write about these topics assume that everyone else knows what they are talking about. I don’t blame them because that’s what happens when an old fool decides to learn something about machine learning.  I never really understood the concept behind the function train_test_split(). Ben explains this very succinctly. Thank You!!!

scikit-learn provides a helpful function for partitioning data, which splits out your data into a training set and a test set.

In this post we’ll show how it works.

We’ll create some fake data and then split it up into test and train.

Let’s imagine our data is modelled as for

# That simply means the value of y is “1” if X0 + X1 <= 10 else if its greater than zero the value of y is “0“. I think that was common sense.

``````
import pandas as pd
import numpy as np
np.random.seed(10)
X = pd.DataFrame({
'x_0': np.random.randint(1, 11, 20),   # Generate random number
'x_1': np.random.randint(1, 11, 20) }) # Min-1, Max-10, Total-20
y = (X['x_0'] + X['x_1']).map(lambda x: 1 if x <= 10 else 0)
# The algebraic function written in python
```
```

So we start with 20 samples of random numbers between 1 and 10, this is our feature set.

`Notice how the values align. 10+5 > 10 therefore y=0 5+2 < 10 therefore y=1 and so on and so forth  `
From these, we want to get a test and training set of data so we can use our `train_test_split`.
We provide the proportion of data to use as a test set and we can provide the parameter `random_state`, which is a seed to ensure repeatable results.
`from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)`
`len(X_train): 15 & len(y_train): 15 vs. len(X_test): 5 & len(y_test): 5`