Home>

Machine learning is performed with RandoForestRegressor of scikit-learn, and visualization is performed using Graphviz module. You can specify whether to perform bootstrap with the argument of RandomForestRegressor, but I noticed that the value of samples when visualized by Graphviz changes depending on whether bootstrap is performed or not. For example, if the number of samples is 85,

bootstrap = TrueWhensamples = 56
bootstrap = FalseWhensamples = 85

It was. This is in line with the fact that bootstrap allows 66% of the original training samples to be used with duplication. However, bootstrap extracts the same number as the number of samples, so the replacement of the notation from samples = 85 to samples = 56 feels a little strange considering the number of samples used for learning in the first place. Of course, of the sample used for learningtypeIf so, it is considered that the data sampling by bootstrap is performed correctly.
I hope you tell me.

  • Answer # 1

    Random Forest bootstrap is a random selection of N from N samples with duplicates.
    Therefore, duplication occurs randomly, and as a result, assuming that there are M unique samples after extraction, if N is reasonably large, the following relationship will occur.

    M ≒ N * 0.63 ~ 0.64

    It sounds strange, but the link introduced by the questioner has a theoretical background. Also, if you think of this event as whether or not a specific sample is extracted (hit), it becomes a famous gacha problem. Again, the value is about 63%.

    Therefore,Graphviz outputs samples = M (unique number of samples)Can be inferred to be. If bootstrap = False, all samples will be selected, so samples = N (= M).

    Due to the nature of the decision tree, the same sample is classified on the same side when the sample is divided and branched. Therefore, having more than one sample has only the effect of weighting. Therefore, to be honest with the principle of random forest, it is more convenient to manage internally in units of unique weighted samples (M) than to manage internally as N samples with duplication. I think that it is displayed like this.

    From the above, this speculation seems to be correct, but unfortunately no statement to support or disprove this speculation was found on the net.