I have a question about the following code.

import numpy as np
from sklearn.model_selection import KFold
x = np.array ([[1, 2], [3, 4], [1, 2], [3, 4], [3, 4], [3, 4], [3, 4], [ 3, 4], [3, 4], [3, 4]])
y = np.array ([1, 1, 1, 0, 0, 0, 0, 1, 1, 1])
kf = KFold (n_splits = 4)
for train_idx, test_idx in kf.split (x, y): #
    print ("train_idx:", train_idx, "test_idx:", test_idx)

As in "" of this code, the array of (x) and (y) is put in the argument of kf.split. Is this y necessary? The same is true for StratifiedKFold. The objective variable y is set with x, so if you select an index for x, y must be an index corresponding to it. In fact, the official document defaults to the argument y being none, and the code at hand is

for train_idx, test_idx in kf.split (x): #
    print ("train_idx:", train_idx, "test_idx:", test_idx)

This will appear to work correctly.
However, there are many sample codes that include both explanatory variables and objective variables as arguments.
Even the reference book is confused because it is only one or both depending on the book.

Why don't you just split it with explanatory variables?

  • Answer # 1

    In the case of

    KFold.split,yis treated as an option. It works even if you don't write it, but it's better to write it from the viewpoint of readability. Also, for reasons described later (required forStratifiedKFold.split), those who write can easily swapKFoldandStratifiedKFold( Compatible code).

    sklearn.model_selection.KFold — scikit-learn 0.21.3 documentation

    is a required positional argument inStratifiedKFold.split. AsStratification is done based on the y labels., it means that Stratification (stratification if translated into Japanese) is not possible withouty.

    sklearn.model_selection.StratifiedKFold — scikit-learn 0.21.3 documentation