I have a question about machine learning.
We are building a machine learning model with a very small dataset (less than 100). Binary classification is performed, and the number of data for each is almost the same. However, I noticed that the score changes greatly depending on how the training data and test data are divided.
If the score changes significantly with the random_state value of train_test_split, do you recognize that i am overfitting?
In addition, although there is little bias in the results that we want to classify, the number of data is very small in the first place, and we believe that there is a bias as data. (If i compare it with iris data, it seems that iris.target is well-balanced at hand, but sepal length etc. has only a fairly narrow range of data.)
I don't think there is any way to overfit to some extent as long as machine learning is performed on such data, but what about it? I was wondering where to compromise because the improvement of overfitting did not go well.
・ The number of data is small (it is difficult to increase immediately, and it will increase slowly in the future)
・ Features are relatively small
The latter question? I don't mind if it's just a soliloquy level comment, so I'd appreciate it if you could answer.
Answer # 1
If you have a question"Insufficient data" instead of "overfitting"is.
Due to the lack of data, the results are highly dependent on the choice of training and test data. The results will change depending on whether the selected training data and test data have similar tendencies.
It cannot be said unconditionally how much is "insufficient data" and what is "overfitting". However, I dared to tell you that there was "insufficient data" because of the situation of your question, such as regularization and dropouts.It seems that taking measures against overfitting has no effectBecause. On the other hand, if there is a "data shortage", increasing the data is the basic measure, but cross-validation and the methods described in the reference below are also effective measures.
Reference: Inflating and transfer learning
In machine learning, before writing code, the data itself is visualized with a graph etc.It is important to grasp the tendency of the whole datais. Let's work on it first. You will be able to intuitively understand that there seems to be a lot of variation in the data compared to the amount of data, and that there is not enough data yet.
- python - inconsistency in sample size of cnn machine learning with keras
- python - abnormal termination in machine learning using jupyter notebook
- how to apply machine learning (svm) when a list is included in a python explanatory variable (parameter)
- machine learning in python, breast cancer diagnosis
- python - machine learning feature extraction of time series data
- python - about the ratio of machine learning training data, validation data, and test data
- python - machine learning training/verification data, test data accuracy rate and adjustment
- python machine learning
- is machine learning really a black box?
- machine learning - error when learning yolo format original data set in googlecolab/yolov3/darknet environment
- (beginner among machine learning beginners) please tell me how to actually test with tensorflow
- only part of the data is recognized when learning python
- machine learning - how to determine the threshold of the k-nearest neighbor method for time series data, etc
- i tried to collect by google_images_download to collect machine learning data, but i can not download even one
- python - about errors in deep learning
- machine learning - image discrimination method for a large number of leaflet images
- python - a code that automatically selects the features used in machine learning
- machine learning - image file cannot be written by etcher in jetson setup
- python 3x - python keras about the shape of learning data
- python - about learning mnist data with keras
- python - "alphazore: introduction to artificial intelligence programming practice" that learns only 96 training data o
- python - when i tried to inflate the image, i couldn't read the image
- i want to use neologd dictionary in python
- python - i should have a file, but i get a filenotfounderror
- logistic regression analysis in python
- python machine learning
- python - how to solve errors when learning with svm
- python - what you do not understand by calculating the precision and recall
- python - i want to reduce gpu memory usage with tensorflow, but it doesn't work
- python - runtime warning appears while running pca