TAGS :Viewed: 5 - Published at: a few seconds ago

[ Learning curves with train/test data in sci-kit instead of cross validation ]

I have a my training and testing data separate (from different CSV loaded into different pandas dataframe) and I want to plot the learning curve with this training and testing data instead of training and test data generated from training set itself using cross validation (which seems to be the usual way learning_curve works).

It seems like scikit expects your testing and training data to be present in the same Dataframe, but this way the classifier would learn the test data as well which is not what I want.

How can I go about solving this problem ? I am new to sci-kit.

Answer 1

You will need to keep your training and test data separate (at least in separate variables within the code). The learning curve can then be applied on the training set. This way you can optimize your experiment without using the test set (in order to avoid overfitting).

To verify how well you are doing on the test set, scikit-learn offers the validation curve which evaluates against the test set.