Home>
What is the general ratio of each data in XGBoost?

The other day, I received advice that it would be better to use train/validation/test when selecting operational parameters, and I am thinking of trying it.

At present, we are thinking at the following ratios.

  • train 80%
  • validation 10%
  • test 10%
How should we consider the accuracy rate of validation and test data as a way to receive the results learned together?

Result example 1
validation Correct answer rate 60%
test Correct answer rate 60%

Result example 2
validation Correct answer rate 70%
test Correct answer rate 50%

Result example 3
validation Correct answer rate 50%
test Correct answer rate 70%

In the case of the above result example, I think that the one with the smaller difference in the correct answer rate (result example 1) is better. Is that correct?
How should I adjust if the difference between validation and test is large?


▼ Environment etc.
Windows 10
python 3.7
Machine learning XGBoost
I am using optuna to adjust the parameters.

  • Answer # 1

    What is the general ratio of each data in XGBoost?

    There is nothing particularly general, and I think that it will be adjusted according to the data and purpose.
    The more training data you have, the better you can learn, but the less verification and test data you have, the more likely you are to overfit.
    I think that the learning data is about 50% to 80% if I force it.
    There are also techniques such as cross-validation, rather than just learning/validating.
    I tried to sort out the types of cross validation

    How should we consider the accuracy rate of validation and test data as a way to receive the results learned together?

    In the case of the above result example, I think that the one with the smaller difference in the correct answer rate (result example 1) is better. Is that correct?
    How should I adjust if the difference between validation and test is large?

    I think it's correct. However, I think it is more of a comparison between learning and verification/test accuracy rates.

    If the difference opens,
    Check the distribution of data between learning, verification, and testing, and disperse it if it is biased.

    in addition,

    Result example 2 (in other words, when the correct answer rate in learning is high and the correct answer rate in verification/test is low)
    Because there is a high possibility of overfitting
    Adjust the hyperparameters of the algorithm to suppress overfitting
    Increase the ratio of training data
    Such will be considered

    Result example 3 (in other words, when the correct answer rate in learning is low and the correct answer rate in verification/test is high)
    It's a little difficult to think about, but is it possible to check the bias of the distribution of each data?