Machine learning is performed with RandoForestRegressor of scikit-learn, and visualization is performed using Graphviz module. You can specify whether to perform bootstrap with the argument of RandomForestRegressor, but I noticed that the value of samples when visualized by Graphviz changes depending on whether bootstrap is performed or not. For example, if the number of samples is 85,
bootstrap = TrueWhensamples = 56
bootstrap = FalseWhensamples = 85
It was. This is in line with the fact that bootstrap allows 66% of the original training samples to be used with duplication. However, bootstrap extracts the same number as the number of samples, so the replacement of the notation from samples = 85 to samples = 56 feels a little strange considering the number of samples used for learning in the first place. Of course, of the sample used for learningtypeIf so, it is considered that the data sampling by bootstrap is performed correctly.
I hope you tell me.
Answer # 1
Random Forest bootstrap is a random selection of N from N samples with duplicates.
Therefore, duplication occurs randomly, and as a result, assuming that there are M unique samples after extraction, if N is reasonably large, the following relationship will occur.
M ≒ N * 0.63 ~ 0.64
It sounds strange, but the link introduced by the questioner has a theoretical background. Also, if you think of this event as whether or not a specific sample is extracted (hit), it becomes a famous gacha problem. Again, the value is about 63%.
Therefore,Graphviz outputs samples = M (unique number of samples)Can be inferred to be. If bootstrap = False, all samples will be selected, so samples = N (= M).
Due to the nature of the decision tree, the same sample is classified on the same side when the sample is divided and branched. Therefore, having more than one sample has only the effect of weighting. Therefore, to be honest with the principle of random forest, it is more convenient to manage internally in units of unique weighted samples (M) than to manage internally as N samples with duplication. I think that it is displayed like this.
From the above, this speculation seems to be correct, but unfortunately no statement to support or disprove this speculation was found on the net.
- i want to run a macro from python, but for some reason i can't
- i don't understand the reason for syntaxerror when executing python code
- python - decisiontreeclassifiers cannot be used for some reason
- python - please tell me the reason for using * in the print function other than the calculation formula
- python - how to convert exponential notation to decimal notation
- about for syntax i want to know the reason why the result is different depending on where the initial value is defined python
- python - accuracy dropped when i increased the sample in random forest
- python - inconsistency in sample size of cnn machine learning with keras
- python reverse polish notation calculation algorithm
- python - sklearn decision tree creation error
- python - you may need to restart the kernel to use updated packages error
- php - coincheck api authentication doesn't work
- php - i would like to introduce the coincheck api so that i can make payments with bitcoin on my ec site
- [php] i want to get account information using coincheck api
- the emulator process for avd pixel_2_api_29 was killed occurred when the android studio emulator was started, so i would like to
- i want to call a child component method from a parent in vuejs
- python 3x - typeerror: 'method' object is not subscriptable
- dart - flutter: the instance member'stars' can't be accessed in an initializer error
- xcode - pod install [!] no `podfile 'found in the project directory