Home>

There are some vague points in my understanding, so please point out more and more if you make a mistake.
My understanding is as follows.

"Learning data": Data used for learning.
"Verification data": Data for confirming whether overfitting is performed. Don't wear it with learning data.
"Test data": Data for confirming generalization performance to see if it can be applied to each unknown. Don't suffer from both training data and verification data.
"Cross-validation": A method for verifying when the amount of data is small.
"K-folded cross-validation method": The training data is divided into K pieces, and one of them is used as "validation data" for verification.

At this stage of knowledge, when I looked into K-validation, I found some sites doing K-validation across the dataset without splitting the test data. It was.
Which is the correct method for cross-validation?
Is it practical even when there is no need to test in the first place?
The following is an example of the site. Thanks for your answer.

https://qiita.com/matsukura04583/items/042fcbf1bc594dfca7a4
https://data-analysis-stats.jp/python/%E3%83%9B%E3%83%BC%E3%83%AB%E3%83%89%E3%82%A2%E3%82%A6 % E3% 83% 88% E6% A4% 9C% E8% A8% BC% E3% 81% A8% E4% BA% A4% E5% B7% AE% E6% A4% 9C% E8% A8% BC /

In cross-validation, the prepared data is divided into K pieces, and the first time, one of them is used as test data and the other data is used as training data for learning and evaluation. The second time uses different data from the first time as test data, and the third time evaluates with different data from the first and second times.

  • Answer # 1

    The answer to the question, "Which is the correct method for cross-validation?" Is that both are correct. The following is the basis.

    I am also ambiguous in understanding, so I took this question as an opportunity to investigate. I think it's better to rely on authority rather than various blogs and tech sites for these basic and misleading things. Therefore, I checked the University of Tokyo Matsuo Lab DL4US as the representative of Japan and the Google Machine Learning Crash Course as the representative of the United States.

    ThenBoth have the same view, but different from the link in another answer (which I also understood).Was shown.

    Description of dataset in DL4US
    Google Machine Learning Crash Course --Training and Test Sets: Splitting Data
    Google Machine Learning Crash Course --Validation Set: Another Partition

    In DL4US, "It is common practice to prepare a test (verification) data set in advance separately from the training data set and evaluate the prediction accuracy for the test (verification) data set after training (. Strictly speaking, the test version refers to the case where only evaluation is performed, and the validation data set refers to the case where the evaluation is used for model selection (adjustment of hyperparameters, etc.). After selecting the model in the validation evaluation, measure the testing rating.) "

    Google explains transing/test and training/validation/test independently, and in the latter, "Tweak model" means adjusting anything about the model you can dream up from changing the learning rate, to adding or removing features, To designing a completely new model from scratch. At the end of this workflow, you pick the model that does best on the test set. You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure (Free translation: I fine-tune the model and use the test data to select the best model, but to reduce overfitting, it is better to divide the dataset into three)

    The following can be seen by combining these.

    Dividing into training data + test data, and further dividing the learning data for training and verification, etc.At least these two don't have the ideas described on Japanese blogs and tech sites...

    I'm not saying that 3 divisions are a better model than 2 divisions,2 or 3 division is an option in parallel relationshipIs.

    A 3-split model is preferable when it involves model adjustment and selection., Is the selection policy.

    In short, if it does not involve model adjustment or selection, or if it does, and there is little chance of overfitting, then all data can be cross-validated. However, in general, model adjustments and selections are made in some way, so I think that there are many 3-split models.

    For me, this idea was convincing. What do you think.

    * I do not claim that the ideas described on Japanese blogs and tech sites are incorrect. It's just an explanation of the fact that they are different. In the implementation, the data is first divided into train and test, and the validation is cut out from the train, so the ideas described on Japanese blogs and tech sites are easy to understand as an implementation.