The other day, I received advice that it would be better to use train/validation/test when selecting operational parameters, and I am thinking of trying it.
At present, we are thinking at the following ratios.
- train 80%
- validation 10%
- test 10%
Result example 1
validation Correct answer rate 60%
test Correct answer rate 60%
Result example 2
validation Correct answer rate 70%
test Correct answer rate 50%
Result example 3
validation Correct answer rate 50%
test Correct answer rate 70%
In the case of the above result example, I think that the one with the smaller difference in the correct answer rate (result example 1) is better. Is that correct?
How should I adjust if the difference between validation and test is large?
▼ Environment etc.
Windows 10
python 3.7
Machine learning XGBoost
I am using optuna to adjust the parameters.
-
Answer # 1
Related articles
- machine learning in python, breast cancer diagnosis
- python - abnormal termination in machine learning using jupyter notebook
- python - machine learning feature extraction of time series data
- python - about deep learning programs using keras
- python machine learning
- about making learning prints with python
- python - inconsistency in sample size of cnn machine learning with keras
- i have a question about basic python problems
- python 3x - about downloading anaconda
- python - about the optimum angle of rotation matrix
- python - about downloading youtube videos by youtube-dl
- about processing to exclude the character string group specified from list in python
- python - what i don't understand about yolo9000 (v2)
- about the implementation of combinations in python
- about batch change of file name using python
- about the python speedtest code
- python - about write loop to csv
- machine learning - when competing for image classification in kaggle, do the top prize winners create their own models?
- please tell me about the role of python tag = "mychr"
- about python def issues
- python - "alphazore: introduction to artificial intelligence programming practice" that learns only 96 training data o
- python - when i tried to inflate the image, i couldn't read the image
- i want to use neologd dictionary in python
- python - i should have a file, but i get a filenotfounderror
- logistic regression analysis in python
- python machine learning
- python - how to solve errors when learning with svm
- python - what you do not understand by calculating the precision and recall
- python - i want to reduce gpu memory usage with tensorflow, but it doesn't work
- python - runtime warning appears while running pca
There is nothing particularly general, and I think that it will be adjusted according to the data and purpose.
The more training data you have, the better you can learn, but the less verification and test data you have, the more likely you are to overfit.
I think that the learning data is about 50% to 80% if I force it.
There are also techniques such as cross-validation, rather than just learning/validating.
I tried to sort out the types of cross validation
I think it's correct. However, I think it is more of a comparison between learning and verification/test accuracy rates.
If the difference opens,
Check the distribution of data between learning, verification, and testing, and disperse it if it is biased.
in addition,
Result example 2 (in other words, when the correct answer rate in learning is high and the correct answer rate in verification/test is low)
Because there is a high possibility of overfitting
Adjust the hyperparameters of the algorithm to suppress overfitting
Increase the ratio of training data
Such will be considered
Result example 3 (in other words, when the correct answer rate in learning is low and the correct answer rate in verification/test is high)
It's a little difficult to think about, but is it possible to check the bias of the distribution of each data?