Split available dataset into training and test

June 03, 2021

How to split data set into training and test data set

We can train the model using data which we call as training data or training set. The training data is the one which already has the actual value that the model should have predicted and thus the algorithm changes the value of parameters to account for the data in the training set.

But how do we know after training the model is overall good ?
For that, we have test data/test set which is basically a different data for which we know the values but this data was never shown to the model before. Thus if the model after training is performing good on test set as well then we can say that the Machine Learning model is good.

If the model is not tested and is made such that it just perform good on training data then parameters will be such that they are only good enough to predict the value for data which was in training set. That is not general. This is called overfitting.

So we don’t land making a useless model which is only good for the training set and not general enough.

split data set into training and test data set in python :

first we load the data set

now we declare x & y variable which we all know that we have to pass this in fit method during training our model.

Now we use sklearn library's train_test_split module and split the data into training and test data.

Here test_size = 0.3 means 30% of data is test data set and 70% of data we use for train our ml model.

Lets check how X_train & y_train look . NOTE that it's randomly select data

Hope you understand.

if you want to download data set then click here.

Search This Blog

Code