Home>

I have a question about machine learning.
We are building a machine learning model with a very small dataset (less than 100). Binary classification is performed, and the number of data for each is almost the same. However, I noticed that the score changes greatly depending on how the training data and test data are divided.
If the score changes significantly with the random_state value of train_test_split, do you recognize that i am overfitting?

In addition, although there is little bias in the results that we want to classify, the number of data is very small in the first place, and we believe that there is a bias as data. (If i compare it with iris data, it seems that iris.target is well-balanced at hand, but sepal length etc. has only a fairly narrow range of data.)
I don't think there is any way to overfit to some extent as long as machine learning is performed on such data, but what about it? I was wondering where to compromise because the improvement of overfitting did not go well.
・ The number of data is small (it is difficult to increase immediately, and it will increase slowly in the future)
・ Features are relatively small

The latter question? I don't mind if it's just a soliloquy level comment, so I'd appreciate it if you could answer.

  • Answer # 1

    If you have a question"Insufficient data" instead of "overfitting"is.

    Due to the lack of data, the results are highly dependent on the choice of training and test data. The results will change depending on whether the selected training data and test data have similar tendencies.

    It cannot be said unconditionally how much is "insufficient data" and what is "overfitting". However, I dared to tell you that there was "insufficient data" because of the situation of your question, such as regularization and dropouts.It seems that taking measures against overfitting has no effectBecause. On the other hand, if there is a "data shortage", increasing the data is the basic measure, but cross-validation and the methods described in the reference below are also effective measures.

    Reference: Inflating and transfer learning

    In machine learning, before writing code, the data itself is visualized with a graph etc.It is important to grasp the tendency of the whole datais. Let's work on it first. You will be able to intuitively understand that there seems to be a lot of variation in the data compared to the amount of data, and that there is not enough data yet.